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Abstract 


High-performance  discrete  Fourier  transform  (DFT)  libraries  are  an  important  requirement  for 
many  computing  platforms.  Unfortunately,  developing  and  optimizing  these  libraries  for  mod¬ 
ern,  complex  platforms  has  become  extraordinarily  difficult.  To  make  things  worse,  performance 
often  does  not  port,  thus  requiring  permanent  re-optimizations.  Overcoming  this  problem  has 
been  the  goal  of  Spiral,  a  library  generation  system  that  can  produce  highly  optimal  DFT  code 
from  a  high  level  specification  of  algorithms  and  target  platforms. 

However,  current  techniques  in  Spiral  cannot  support  all  target  platforms.  In  particular,  sev¬ 
eral  emerging  target  platforms  incorporate  a  distributed  memory  parallel  processing  paradigm, 
where  the  cost  of  accessing  non-local  memories  is  relatively  high,  and  handling  data  movement 
is  exposed  to  the  programmers.  Traditionally  used  only  in  supercomputing  environments,  this 
paradigm  is  now  also  finding  its  way  in  the  form  of  multicore  processors  into  desktop  computing. 

The  goal  of  this  work  is  the  computer  generation  of  high-performance  DFT  libraries  for  a 
wide  range  of  distributed  memory  parallel  processing  systems,  given  only  a  high-level  descrip¬ 
tion  of  a  DFT  algorithm  and  some  platform  parameters.  The  key  challenges  include  generat¬ 
ing  code  for  multiple  target  programming  paradigms  that  delivers  load  balanced  parallelization 
across  multiple  layers  of  the  compute  hierarchy,  orchestrates  explicit  memory  management,  and 
overlaps  computation  with  communication. 

We  attack  this  problem  by  first  developing  a  formal  framework  to  describe  parallelization, 
streaming,  and  data  exchange  in  a  domain-specific  declarative  mathematical  language.  Based 
on  this  framework,  we  develop  a  rewriting  system  that  structurally  manipulates  DFT  algorithms 
to  “match”  them  to  a  distributed  memory  target  architecture  and  hence  extracts  maximum  per¬ 
formance.  We  implement  this  approach  as  a  part  of  Spiral  together  with  a  backend  that  trans- 
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lates  the  derived  algorithms  into  actual  libraries.  Experimental  results  show  that  the  perfor¬ 
mance  of  our  generated  code  is  comparable  to  hand-tuned  code  in  all  cases,  and  exceeds  the 
performance  of  hand-tuned  code  in  some  cases. 
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Chapter  1 


Introduction 


Most  compute  intensive  applications  spend  the  bulk  of  their  run  time  in  basic  numerical  func¬ 
tions,  called  kernels.  Examples  of  such  kernels  include  matrix  multiplication,  linear  system 
solvers,  and  linear  transforms.  The  most  important  transform,  and  arguably  one  of  the  most  im¬ 
portant  kernels  overall,  is  the  discrete  Fourier  transform  (DPT).  The  DFT  is  used  by  applications 
as  diverse  as  audio,  image,  and  video  processing,  wireless  communications,  molecular  simula¬ 
tions,  and  differential  equation  solvers.  These  applications  are  executed  on  various  computing 
platforms,  from  large  scale  supercomputers  to  desktop  computers  to  embedded  systems. 

Typically,  software  application  developers  use  libraries,  available  for  kernels  such  as  the  DFT, 
for  two  main  reasons.  First,  they  lower  the  programming  burden  by  allowing  code  reuse.  Sec¬ 
ond,  libraries  can  abstract  away  the  need  for  platform-specific  performance  optimizations:  ap¬ 
plications  can  make  library  calls  while  being  oblivious  to  platform  specihc  details,  while  libraries 
providing  the  same  functionality  can  be  tuned  separately  on  various  platforms  for  performance. 

Since  a  significant  part  of  an  application’s  run  time  is  spent  in  calls  to  libraries,  it  is  critical  for 
such  libraries  to  be  fast.  However,  due  to  the  increasing  complexity  and  diversity  of  hardware  ar¬ 
chitectures,  it  is  common  for  a  DFT  library  that  performs  well  on  one  platform  to  perform  poorly 
on  another.  As  mentioned  above,  it  has  thus  become  increasingly  common  to  tune  libraries  to 
specific  platforms  for  maximal  performance.  For  this  reason,  several  implementations  of  a  li¬ 
brary  may  exist  optimized  for  different  platforms. 

We  illustrate  the  problem  for  the  DFT  on  a  recent  multicore  processor  (the  Cell  BE,  with  9 
cores)  in  Figure  1.1.  The  horizontal  axis  spans  a  range  of  input  problem  sizes,  and  the  vertical  axis 
is  performance  (inverse  run  time)  in  pseudo-giga-fioating  point  operations  per  second  (higher 
is  better).  “Pseudo”  means  that  the  operations  count  is  estimated  as  SNlogN.  Performance  is 
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FFTs  on  a  Cell  processor 

Performance  [pseudo  Gflop/s] 


Figure  1.1:  DFT  performance  on  the  IBM  Cell  BE  (peak  floating  point  performance  is  204.8  Gflop/s).  A 
naive  DFT  implementation  based  on  only  minimizing  the  operations  count  under-performs  by  two  orders 
of  magnitude,  and  even  expert  libraries  do  not  achieve  the  optimal  performance. 


hence  inversely  proportional  to  run  time.  The  plot  shows  the  performance  of  four  difference 
DFT  implementations,  all  based  on  fast  algorithms,  called  fast  Fourier  transforms  (FFTs).  The 
bottom  line,  which  performs  worst  is  the  C  implementation  from  Numerical  Recipes  of  the  fast 
Fourier  transform  based  directly  on  its  mathematical  definition  [Press  et  al.,  1992].  FFTW  [Frigo 
and  Johnson,  2005] ,  a  popular  adaptive  library  with  a  Cell  extension  performs  considerably  bet¬ 
ter,  but  fails  to  achieve  over  10  Gflop/s  for  the  problem  sizes  shown.  FFTC  [Bader  and  Agarwal, 
2007] ,  a  library  written  and  manually  tuned  specifically  for  the  Cell  performs  better,  but  is  still 
suboptimal  on  the  Cell.  The  topmost  line  shows  the  performance  of  code  that  was  automatically 
generated  by  the  work  in  this  thesis.  It  is  two  orders  of  magnitude  faster  than  the  naive  Numerical 
Recipes  implementation,  and  up  to  1.6x  faster  than  the  fastest  hand-tuned  implementation. 

Surprisingly,  all  the  implementations  shown  in  Figure  1.1  have  roughly  the  same  operations 
count,  which  dramatically  illustrates  the  limitations  of  compilers  on  modern  multicore  plat¬ 
forms.  Ideally,  simply  compiling  the  Numerical  Recipes  code  would  yield  the  performance  of  the 
topmost  line.  However,  this  is  likely  to  be  impossible  since  the  most  important  optimizations 
involve  restructuring  the  algorithms  to  match  the  features  of  the  target  platform.  For  distributed 
memory  platforms  like  the  Cell,  this  means  generating  parallelized  code  that  is  load  balanced. 
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code  that  uses  the  interconnect  between  the  cores  efficiently,  and  code  that  overlaps  computa¬ 
tion  with  communication  to  minimize  or  hide  communication  costs. 

Mapping  a  DPT  to  a  complex  platform  like  the  Cell  is  difficult.  There  are  various  FFTs,  each  of 
which  can  be  modified  in  numerous  ways  to  better  “fit”  a  given  target  platform.  General  purpose 
compilers  are  unable  to  perform  this  task  because  expressing  an  FFT  in  an  imperative  language 
like  C  fixes  the  algorithm  being  used,  and  the  compiler  lacks  the  domain  knowledge  to  change  it. 
Hence,  the  optimization  task  is  left  with  the  programmer  who  needs  to  have  a  profound  under¬ 
standing  of  the  algorithm  and  platform  to  adapt  to  its  parallelism,  memory  structure,  and  other 
architectural  details. 

Further  compounding  this  problem,  the  number  of  platforms  is  also  large,  and  they  are  con¬ 
tinuously  updated.  Writing,  tuning,  and  maintaining  separate  DFT  libraries  for  each  of  these 
platforms  is  very  expensive,  and  in  practice,  infeasible.  The  goal  of  this  thesis  is  to  overcome  this 
problem  for  DFT  libraries  targeting  the  important  class  of  parallel  distributed  memory  platforms. 
The  idea,  hinted  at  in  Figure  1.1,  is  to  replace  the  human  programmer  by  a  program  generation 
system  that  produces  optimal  DFT  code  from  a  high-level  specification  and  a  few  platform  pa¬ 
rameters.  Before  we  state  the  goal  in  more  detail,  we  provide  some  background  on  distributed 
memory  platforms  and  state-of-the-art  DFT  libraries  to  put  our  work  in  context. 


1 . 1  Distributed  Memory  Platforms 

Simply  put,  a  distributed  memory  architecture  is  one  in  which  multiple  processors  exist,  each 
with  its  own  private  memory.  Moving  data  between  the  processors,  when  required,  is  relatively 
expensive  relative  to  computation. 

The  distributed  memory  architectural  paradigm  exists  in  both  chip -based  processors  and 
node-based  supercomputers.  As  Figure  1.2(c)  shows,  chip-based  distributed  memory  architec¬ 
tures  like  the  Cell  BE  are  constructed  from  many  single  core  processors  (Figure  1.2(a))  with  as¬ 
sociated  on-chip  local  memories,  connected  via  an  on-chip  interconnect.  Larger  supercomput¬ 
ing  platforms  also  follow  the  distributed  memory  paradigm:  as  Figure  1.2(d)  shows,  platforms 
like  the  Cray  XT5  are  constructed  using  many  multicore  shared  memory  chip  processors  (Fig¬ 
ure  1.2(b))  connected  via  an  external  network  like  InfiniBand. 

To  see  how  the  distributed  memory  paradigm  evolved  and  why  the  paradigm  is  and  will  re¬ 
main  an  important  feature  of  computing  systems,  we  look  at  these  two  broad  classes  of  comput¬ 
ing  platforms — supercomputing  platforms  and  desktop  computing  platforms — in  greater  detail 
below. 

Parallelism  in  supercomputing  platforms.  Supercomputing  platforms  are  designed  to  run 
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Figure  1.2:  Architectures  targeted  by  this  thesis,  (a)  shows  a  single  core  SIMD  CPU.  (b)  shows  a  cache 
based  multicore  CPU  with  shared  cache  and  main  memory,  (c)  shows  a  distributed  memory  scratchpad 
based  multicore  CPU  with  shared  main  memory,  (d)  shows  a  distributed  memory  multi-node  supercom¬ 
puter.  In  this  thesis,  we  target  architectures  of  the  type  shown  in  (c)  and  (d). 


large  scale  compute  intensive  programs.  In  original  supercomputing  designs,  processing  was 
typically  the  bottleneck,  and  the  focus  was  on  designing  faster  scalar  processors.  As  processor 
speeds  scaled  to  their  limits,  the  need  for  ever  higher  compute  power  made  designers  turn  to 
parallelism,  first  in  the  form  of  vector  architectures,  then  in  the  form  of  multi-node  (4-16)  ar¬ 
chitectures,  and  finally,  in  current  massively  parallel  processing  (MPP)  designs  with  over  200,000 
nodes.  This  is  particularly  evident  in  the  ratio  shift  of  distributed  memory  systems  (clusters  and 
MPP  architectures)  from  25%  in  1993  to  over  99%  in  2010  among  the  top  500  fastest  supercom¬ 
puters  [TOP500,  2010]. 

The  need  for  parallelism  stems  primarily  from  the  impracticality  of  scaling  the  amount  of 
compute  power  in  a  single  processor  or  chip  due  the  power  density  problem  and  contention  for 
off-chip  resources  including  memory.  The  distributed  memory  design  overcomes  these  prob¬ 
lems  by  exploiting  the  data  access  locality  of  programs  to  reduce  overall  contention  for  memory, 
which  in  turn  allows  for  scalability. 

On  the  other  hand,  the  main  design  constraint  in  distributed  memory  systems  is  the  slow 


1.1.  Distributed  Memory  Platforms 


5 


speed  of  data  movement  relative  to  the  processing  speed.  This  means  that  the  exchange  of 
data  often  causes  bottlenecks  in  computing.  Another  problem  is  that  application  programs  ex¬ 
ecuting  on  distributed  memory  architectures  must  explicitly  manage  data  movements,  which 
increases  programming  complexity.  Even  with  languages  like  Unified  Parallel  C  (UPC)  that  at¬ 
tempt  to  alleviate  programming  complexity  by  providing  the  application  with  a  single  shared 
partitioned  global  address  space  (PGAS),  the  underlying  hardware  constraints  still  exist,  and 
algorithms  must  be  designed  to  both  minimize  data  movement,  and  minimize  overhead  costs 
by  reading  and  writing  non-local  data  in  large  packets,  in  many  cases,  algorithms  must  be  re¬ 
designed  and  rewritten  at  a  high  level  to  achieve  the  required  characteristics. 

in  summary,  distributed  memory  is  is  an  important  feature  of  supercomputing  systems,  and 
is  expected  to  remain  so  for  the  foreseeable  future.  We  next  look  at  this  paradigm  in  desktop 
computing  platforms. 

Parallelism  in  desktop  platforms.  For  several  decades,  the  growth  in  performance  of  com¬ 
puting  platforms,  and  in  particular  chip  processors,  primarily  came  from  frequency  scaling.  In 
the  last  few  years,  frequency  scaling  has  plateaued  due  to  physical  limitations.  To  keep  increas¬ 
ing  floating  point  performance  in  accordance  with  Moore’s  Law,  architects  have  instead  turned  to 
parallelism.  While  this  allows  for  growth  in  peak  theoretical  floating  point  performance,  it  causes 
new  problems:  programming  complexity,  and  the  increased  contention  for  shared  resources. 

Programming  for  parallel  architectures  is  difficult  by  nature  of  the  problem.  Software  must 
be  written  to  expose  parallelism  to  take  advantage  of  multicore  processors.  In  addition,  software 
may  have  to  take  into  account  the  specifics  of  the  hardware  platform  in  order  to  run  efficiently. 
As  an  example,  false  cache-line  sharing  is  a  problem  introduced  by  multicore  processors  that 
affects  performance.  Software  must  be  aware  of  the  cache-line  size  of  the  processor  to  ensure 
that  false-sharing  is  minimized.  This,  in  combination  with  the  increasing  number  of  different 
chip  processors  leads  to  reduced  performance  portability,  and  further  increases  programming 
burden  by  forcing  developers  to  create  and  maintain  differently  tuned  versions  of  their  software 
for  various  platforms. 

The  scaling  of  parallelism  in  chip  processors  also  leads  to  increased  contention  for  both  on- 
chip  and  off-chip  shared  resources.  Arguably  the  most  important  resource  is  access  to  memory, 
including  both  on-chip  (shared  coherent  caches)  and  off-chip  (main  memory)  types.  Scaling 
the  number  of  cores  on  a  chip  places  an  enormous  burden  on  shared  on-chip  caches,  and  also 
on  the  off-chip  data  bandwidth,  which  does  not  scale  with  the  number  of  cores  per  chip.  This 
further  aggravates  the  existing  problem  of  the  “memory  wall,”  which  is  the  problem  where  data 
intensive  programs  are  unable  use  the  available  processing  power  due  to  inadequate  off-chip 
latencies  and  bandwidths. 
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One  possible  solution  is  the  distributed  cache  approach,  where  cache  memories  are  placed 
physically  close  to  the  processing  core  that  will  primarily  use  them.  This  is  reminiscent  of,  and 
very  similar  to  the  idea  of  distributed  local  memories  used  by  supercomputing  platforms. 

Such  cache  systems  maybe  coherent  (as  in  the  Intel  Larrabee)  and  allow  for  a  shared-memory 
programming  paradigm,  or  non-coherent  (like  in  the  Cell  BE),  and  require  a  distributed  memory 
programming  paradigm.  Regardless  of  the  nature  of  coherency  or  the  programming  paradigm 
supported,  the  underlying  hardware  tradeoffs  are  similar  to  distributed  memory  platforms  in 
supercomputers;  inter-core  data  transfers  are  expensive  as  they  are  subject  to  bandwidth  and 
latency  limitations,  and  also  suffer  overhead  costs.  From  a  performance  perspective,  software 
must  minimize  inter-core  data  transfers,  and  use  algorithms  designed  to  transfer  memory  using 
large  packet  sizes.  In  addition,  applications  must  manage  the  reduced  per-core  memory  band¬ 
width  well.  These  factors  further  add  to  the  programming  burden,  especially  for  performance 
critical  software  like  DPT  libraries. 

The  future  of  the  distrihuted  memory  paradigm.  For  the  reasons  we  have  stated,  the  dis¬ 
tributed  memory  paradigm  is  an  important  template  in  the  architect’s  toolbox  to  design  plat¬ 
forms  that  can  scale  in  performance.  It  is  currently  a  mainstay  of  supercomputing  platforms, 
and  is  expected  to  remain  so  for  the  foreseeable  future.  In  addition,  it  is  rapidly  finding  its  way 
into  desktop  platforms.  Thus,  from  the  perspective  of  writing  performance-critical  libraries,  it  is 
important  to  understand  how  to  develop  software  for  the  distributed  memory  paradigm,  and  in 
addition,  how  to  reduce  the  development  and  optimization  effort. 


1 .2  High-Performance  DFT  Libraries 

As  demonstrated  in  Figure  1.1,  writing  high-performance  DFT  libraries  is  difficult.  There  have 
been  several  approaches  to  achieving  platform-tuned  high  performance  DFTs  in  the  past.  In  this 
section  we  discuss  approaches  for  non- distributed  memory  systems,  followed  by  approaches  for 
distributed  memory  systems.  We  also  illustrate  the  deficiencies  in  current  state-of-the-art  of  DFT 
libraries  for  distributed  memory  systems  that  this  thesis  addresses. 

The  problem.  Computing  the  DFT  using  a  straightforward  implementation  of  an  FFT  that 
only  minimizes  operations  count  does  not  yield  performance  as  Figure  1.1  shows.  This  is  mainly 
because  run  time  on  modern  architectures  is  dominated  by  architectural  features  such  as  par¬ 
allelism  and  the  memory  hierarchy.  This  means  that  the  “structure”  of  an  algorithm  matters 
as  much  as  the  operations  count.  As  mentioned  earlier,  a  general  purpose  compiler  is  unable 
to  perform  the  necessary  optimizations  because  it  lacks  domain  knowledge.  Parallelizing  and 
vectorizing  compilers  are  unsuccessful  for  the  same  reasons.  Below  we  look  at  alternative  ap- 
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proaches  to  tuning  DFT  libraries. 

1.2.1  DFT  Libraries  for  Non- Distributed  Memory  Platforms 

Hand-tuned  libraries.  Several  hand-tuned  DFT  libraries  exist  for  single  core  and  multicore  plat¬ 
forms.  Vendor  libraries  like  Intel’s  IPP  and  MKL  are  hand-tuned  specifically  to  the  vendor’s  hard¬ 
ware  products.  FFTE  [Takahashi,  2002]  is  a  DFT  library  that  includes  support  for  a  range  of  shared 
memory  multicore  processors.  The  GNU  Scientific  Library  (GSL)  [FSF,  2010],  also  manually  writ¬ 
ten,  supports  FFTs  and  is  meant  to  be  used  on  a  wide  variety  of  platforms.  Accordingly,  the  per¬ 
formance  is  suboptimal. 

There  are  several  issues  with  manual  tuning:  first,  it  requires  programmers  with  combined 
algorithm  and  architecture  expertise,  which  is  expensive.  Second,  such  a  manually  tuned  library 
may  not  perform  well  on  other  hardware.  Third,  libraries  may  have  to  be  manually  re-tuned 
whenever  new  platforms  are  released. 

Performance  portable  libraries.  One  successful  approach  to  facilitate  performance  porting 
is  the  use  of  adaptive  libraries  like  FFTW  [Frigo  and  Johnson,  2005]  and  UHFFT  [Mirkovic  and 
Johnsson,  2001],  which  both  support  DFTs  on  multicore  architectures.  In  this  approach,  the  li¬ 
brary  has  built-in  degrees  of  freedom  in  how  to  perform  the  computation,  which  arise  from  the 
different  possible  ways  to  recursively  decompose  the  DFT  into  smaller  DFTs.  A  runtime  plan¬ 
ning  routine  (the  planner]  selects  the  best  decomposition  strategy  based  on  timing  information 
obtained  from  the  platform,  at  library  installation  or  program  initialization  time.  Small  DFTs 
are  computed  using  pre-written  code  (called  codelets  in  FFTW  terminology),  which  are  either 
handwritten,  or  in  the  case  of  FFTW,  generated  using  a  special-purpose  compiler  [genfft] . 

Although  a  degree  of  performance  portability  is  achieved  with  these  libraries,  extensive  work 
or  a  fundamental  redesign  may  be  required  to  allow  these  libraries  to  work  on  new  architec¬ 
tural  features  or  new  programming  paradigms.  An  example  of  this  is  the  UHFFT  library:  though 
it  supports  multicore  architectures,  it  is  unable  to  support  SIMD  (Single  Instruction  Multiple 
Data)  vectorization.  In  addition,  these  libraries  are  not  able  to  easily  exploit  new  algorithms,  and 
cannot  easily  be  extended  to  new  compute  functions. 

Program  generation  and  optimization.  The  approach  taken  by  Spiral  [Piischel  et  al.,  2005] 
is  to  “teach”  computers  how  to  generate  and  optimize  code  for  linear  transforms.  Spiral  works 
by  representing  both  the  algorithm  and  the  architecture  at  a  high  level  of  abstraction  using  Kro- 
necker  (tensor)  products  ([Johnson  et  al.,  1990]),  and  manipulates  the  algorithm  at  this  level  to 
map  it  to  the  hardware. 

This  has  several  advantages  over  adaptive  libraries.  First,  Spiral  can  generate  code  for  any 
transform  that  can  be  represented  using  its  internal  abstraction.  These  may  include  transforms 
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other  than  the  DFT,  or  even  functions  heyond  transforms  [Franchetti  et  al.,  2009a].  Second, 
adding  new  algorithms  for  an  existing  transform  is  comparatively  simple  as  long  as  they  can 
he  represented  using  the  internal  abstraction.  New  algorithms  may  yield  increased  performance 
for  specific  sizes,  or  may  be  useful  in  producing  a  desired  tradeoff  point  between  performance 
and  accuracy.  Third,  Spiral  is  designed  to  be  extensible  to  new  architectural  paradigms  such  as 
multithreading  or  vector  instruction  sets  [Franchetti  et  al.,  2005,  2006a].  Fourth,  Spiral  can  also 
generate  adaptive  libraries  of  the  type  discussed  above  [Voronenko,  2008] .  Finally,  since  Spiral 
works  at  a  high  level  of  abstraction,  lower  level  details  such  as  the  programming  paradigm,  pro¬ 
gramming  memory  model  (e.g.,  distributed  or  shared  memory),  and  platform  specific  instruc¬ 
tions  (e.g.,  DMA,  MPl)  can  be  easily  Included  as  backends.  We  discuss  Spiral  in  more  detail  in 
Section  2.4  and  Section  2.4.4. 

1 .2.2  DFT  Libraries  for  Distributed  Memory  Platforms 

Hand-tuned  libraries.  Hand-tuned  libraries  that  exist  for  distributed  memory  platforms  include 
FFTE  [Takahashi,  2002],  IBM’s  eSSL  [IBM,  2010],  and  also  parallelized  versions  of  GSL  [Aliaga 
et  al.,  2009].  All  these  use  fixed  algorithms,  and  while  they  may  perform  well  on  certain  target 
architectures,  performance  typically  does  not  port  over  to  a  range  of  architectures. 

Performance  portable  libraries.  FFTW  includes  support  for  both  the  Cell  BE  and  for  MPI 
based  systems.  These  are  both  separate  components  in  FETW.  On  most  MPI  based  systems, 
FFTW  performs  reasonably  well  and  is  typically  used  as  the  standard  DFT  library  unless  a  faster, 
vendor  tuned  version  is  available. 

Program  generation  and  optimization.  Although  Spiral  can  generate  code  for  architectures 
with  a  memory  hierarchy,  SIMD  vector  architectures,  and  shared  memory  multicore  processors, 
it  is  currently  unable  to  generate  code  for  distributed  memory  platforms.  [Chen  and  lohnson, 
2004]  presented  a  self-adapting  distributed  memory  package  to  compute  the  Walsh-Hadamard 
transform  (WHT),  which  is  similar  to  the  FFT.  [Bonelli  et  al.,  2006]  added  some  preliminary,  ba¬ 
sic  support  to  Spiral  for  generating  DFT  code  for  MPI  platforms.  However,  this  support  was  for 
only  a  single  parallelized  FFT,  was  not  compatible  with  SIMD  vectorizatlon,  did  not  support  flex¬ 
ible  explicit  data  transfers,  and  did  not  support  any  form  of  memory  streaming  or  overlapping 
computation  with  communication  to  minimize  communication  costs. 

In  summary,  the  problem  of  developing  highly  optimized  DFT  libraries  is  very  well  under¬ 
stood  as  shown  by  Spiral  which  completely  automates  the  process.  However,  this  excludes  the 
important  class  of  parallel  distributed  memory  platforms.  While  the  hand-written  FFTW  pro¬ 
vides  a  good  solution,  its  performance  maybe  suboptimal  (Figure  1.1).  Hence,  it  is  desirable  to 
have  a  flexible  method  to  automatically  generate  DFT  libraries  to  produce  a  long-term  solution 
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Figure  1.3:  Goal  of  this  thesis. 

to  the  optimization  and  porting  problem. 

1.3  Problem  Statement 

Although  the  Spiral  approach  has  been  successful  in  automating  library  development  for  DFTs 
and  other  linear  transforms,  Spiral  is  unable  to  generate  code  for  distributed  memory  platforms. 
We  address  this  deficiency  in  this  thesis,  leveraging,  but  also  considerably  extending  prior  work: 

The  goal  of  this  thesis  is  the  computer  generation  of  high-performance  DFT  libraries 
for  distributed  memory  parallel  processing  systems,  given  only  a  high-level  descrip¬ 
tion  of  DFT  algorithms  and  a  set  of  platform  and  architecture  parameters. 

The  proposed  generator  is  visualized  in  Figure  1.3.  The  problem  specification  is  a  functional  de¬ 
scription  of  the  transform  (we  focus  on  the  DFT)  to  generate  and  optimize  code  for.  This  includes 
transform  parameters  (e.g.,  transform  size  for  fixed  size  code),  code  or  library  specification  (e.g., 
block  cyclic  distributed  data  layout,  and  optimization  for  throughput).  The  algorithm  specifica¬ 
tion,  which  is  a  part  of  our  system’s  “knowledge,”  is  a  mathematical  description  of  various  DFT 
algorithms.  The  platform  parameters  provide  a  small  set  of  relevant  architectural  details  (e.g., 
number  of  processors,  vector  length,  data  transfer  packet  size,  programming  paradigm). 

In  Figure  1.4,  we  show  two  examples  of  libraries  that  we  generate.  In  Figure  1.4(a),  we  specify 
that  we  want  to  generate  fixed-size  code  for  a  DFT  of  size  2^  using  2-8  processors,  optimized 
for  throughput,  for  an  input  vector  stored  in  main  memory,  using  the  processor’s  4-way  single 
precision  SIMD  vectorization.  We  also  specify  the  relevant  platform  parameters  of  the  target 
platfom,  which  is  the  Cell  BE  in  this  case,  including  the  SIMD  ISA  and  the  DMA/pthreads  based 
programming  paradigm,  the  memory  paradigm,  the  size  of  the  local  store,  and  the  minimum 
data  packet  size  to  use  for  communications.  Correspondingly,  the  output  is  a  single  function  in 
C  that  implements  the  DFT  using  Cell  specific  programming  instructions. 
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DFT  2^ 

SPESIMD 

Use  2-8  SPEs 

DMA/pthreads  programming 

Fixed  size  code 

Distributed  local  memory 

Throughput  optimized 

Shared  main  memory 

Memory-to-memory 

LSsize=256kB 

4-way  single  precision  SIMD 

Packet  size=128B 

4  4 


Automatic  program  generation 


4 

void  ixDft_9( complex  *Y,  complex  *Xj  int  speid) 
{  complex  al41,  al42,  t470j 
for  (int  i=0;  i<s;  i++) 

{  DMA_Scatter(Yj  Ybuf,  i-1); 
DMA_Gathen(Xbuf,  X,  i+1); 
t470  =  (Xlocal[0]  +  Xlocal[8]); 

} 

} 


DFT2"  SSESIMD 

Use  2^-2®  PEs  MPI  programming 

General  size  code  Distributed  memory 

Latency  optimized  Packet  size=largest 

Blockdata  distribution 
2-way  double  precision  SIMD 

4  4 


Automatic  program  generation 


4 

void  dft_plan(plan  *p,  int  size); 

void  dft_compute(plan  *p,  complex  *Y,  complex  *X) 

{  switch(p->rulG)  { 

case  1:  RS_2_compute( . . . ) J 

} 

static  void  RS_2_computG( . . . )  {...} 
static  void  RS_3_computG( . . . )  {...} 


(a)  Fixed  size  throughput  optimized  code  for  the  Cell  BE  (b)  General  size  latency  optimized  MPI  library 
Figure  1.4:  Example  libraries  that  can  be  generated  using  our  system  in  Figure  1.3. 


In  Figure  1.4(b),  we  specify  a  different  scenario:  we  want  to  generate  a  latency  optimized 
general  size  library  for  tbe  DFT  that  uses  a  number  of  PEs  specified  at  run  time,  a  block  data 
distribution,  and  tbe  architecture’s  2-way  double  precision  SIMD  vectorization.  Under  platform 
parameters,  we  specify  that  we  want  to  use  an  MPI  based  distributed  programming  model,  Intel 
SSE  SIMD  instructions,  and  the  largest  packet  size  possible  for  any  problem  size.  Tbe  output  of 
our  system  in  tbis  case  is  a  general  size  library  ^  that  is  somewhat  similar  to  FFTW:  it  includes 
a  df  t_plan  routine  to  be  called  at  library  initialization  time  with  a  specified  problem  size.  The 
routine  finds  the  fastest  computation  strategy  for  this  size  and  returns  it  as  a  “plan.”  This  means 
that  the  library  bas  an  adaptation  mecbanism.  The  df  t_conipute  routine  then  uses  this  plan  to 
compute  the  DFT  using  a  set  of  mutually  recursive  functions. 

Though  our  system  can  potentially  provide  a  wide  range  of  functionality  on  many  distributed 
memory  platforms,  we  focus  on  tbe  DFT  and  two  representative  platforms:  tbe  Cell  BE  and  MPI 
based  platforms.  We  capture  the  range  of  functionality  that  is  used  to  evaluate  this  thesis  in 
Table  1.1. 


Wo  generate  general  size  Ubraries  in  this  thesis,  we  use  infrastructure  developed  by  Yevgen  Voronenko  and 
Volodymyr  Arbatov,  including  search  mechanisms  developed  by  Frederic  de  Mesmay. 


1.4.  Thesis  Contributions 
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Feature 

Cell 

MPI 

Code  type 

Fixed  input  size  DFT 

General  input  size  DFT 

Memory  paradigm 

Distributed  local  memory,  shared  main 
memory 

Distributed  memory 

Programming  paradigm 

DMA  +  ptbreads 

MPI 

SIMD  ISA 

SPE  2-way  double  precision,  4-way  sin¬ 
gle  precision 

SSE  2-way  and  Power  2-way  double  pre¬ 
cision 

Optimizations 

Latency,  throughput 

Latency 

Data  layouts 

Block  distributed,  block  cyclic 

Block  distributed,  block  cyclic 

Table  1.1:  An  overview  of  the  DFT  library  features  supported  by  our  system  for  the  two  target  platforms. 


From  a  standard  FFT,  and  the  platform  parameters  supplied,  our  system  formally  derives 
FFT  variants  that  are  adapted  to  the  target  architecture.  In  many  cases,  this  eliminates  the  need 
to  search  over  a  large  space  of  variants.  For  platform  parameters  that  are  difficult  to  model,  we 
define  and  search  over  a  small  space  of  potential  algorithms. 

The  output  of  our  system  is  high-performance  platform-tuned  code  for  the  desired  input 
transform(s)  that  includes  platform-specific  instructions  (e.g.,  DMA  code  for  the  Cell  or  MPI 
code  for  supercomputing  platforms).  Our  system  is  also  designed  to  work  with  other  architec¬ 
tural  paradigms  and  optimizations  that  previous  work  with  Spiral  accomplished,  including  the 
SIMD  vector  paradigm  and  optimizations  for  the  memory  hierarchy.  We  evaluate  our  system  on 
several  distributed  memory  platforms,  including  one  chip-based  platform  (the  Cell  BE)  and  a 
few  supercomputing  platforms.  In  this  thesis,  we  focus  on  two-power  input  sizes  for  ID  DFTs, 
and  also  generate  code  for  2D  DFTs. 

1.4  Thesis  Contributions 

The  following  are  the  main  contributions  of  this  thesis; 

•  A  formal,  rigorous,  mathematical  framework  and  its  implementation  that  rewrites  a  DFT 
problem  specification  into  highly  optimized  code  for  a  target  distributed  memory  archi¬ 
tecture.  The  implementation  of  this  framework  enables  the  computer  generation  of  fast 
DFT  code  for  a  range  of  existing  platforms,  and  also  potentially  enables  the  generation  of 
fast  code  for  future  architectures  and  programming  paradigms  based  on  the  distributed 
memory  design. 

•  A  set  of  rewrite  rules  to  formally  parallelize  and  stream  code  at  a  high  level  of  abstraction. 
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•  Working  platform-tuned  libraries  for  the  Cell  and  for  MPI-based  supercomputing  plat¬ 
forms,  all  automatically  produced  by  our  generator. 

•  And  finally,  a  deeper  understanding  of  the  structure  of  FFTs  needed  for  the  parallelism  and 
memory  streaming  features  found  in  distributed  memory  architectures. 


1.5  Related  Work 

We  already  discussed  FFT  libraries  in  Section  1.2.1  and  Section  1.2.2.  Here,  we  discuss  other 
related  work  on  automatic  performance  tuning. 

ATLAS.  [Whaley  et  al.,  2001]  Automatically  Tuned  Linear  Algebra  Software  (ATLAS)  is  an  of¬ 
fline  adaptive  library  that  provides  basic  linear  algebra  subroutine  (BIAS)  functionality.  Its  ap¬ 
proach  is  based  on  the  observation  that  the  tuning  of  matrix  operations  can  be  accomplished  by 
finding  optimal  values  of  certain  parameters  like  loop  block  sizes  and  unrolling  factors.  Bench¬ 
marks  are  executed  on  the  target  platform  to  determine  optimal  values  for  these  parameters  at 
installation  time.  The  corresponding  basic  kernel  is  then  generated  and  used  when  the  library  is 
called  for  BIAS  operations. 

LAPACK  [Anderson  et  al.,  1999]  and  the  distributed  memory  extension  to  it,  ScalAPACK 
[Blackford  et  al.,  1997],  are  linear  algebra  libraries  that  use  an  underlying  BIAS  library  such  as 
ATLAS  to  provide  higher  level  functionality.  The  basic  idea  was  that  ATLAS  would  update  BIAS 
routines  for  new  platforms,  whereas  LAPACK  would  stay  unchanged  over  time.  This  approach 
was  successful  for  many  years,  but  could  not  easily  incorporate  optimizations  for  recent  archi¬ 
tectural  features  such  as  SIMD  vectorization  and  multicore  parallelization. 

Tensor  contraction  engine  (TCE).  The  TCE  compiler  [Baumgartner  et  al.,  2005]  is  an  exam¬ 
ple  of  a  project,  that  like  Spiral,  uses  a  domain-specific  language  to  transform  a  high-level  spec¬ 
ification  of  computation  to  an  optimized  implementation  in  Fortran  code.  It  targets  tensor  con¬ 
tractions  used  in  quantum  chemistry.  It  too  performs  algebraic  transformations  to  minimize  the 
operations  count,  and  then  optimizes  code  based  on  abstractions  of  the  target  architecture.  Like 
LAPACK,  it  relies  on  an  external  library  for  performing  BIAS  operations. 

FLAME.  Formal  Linear  Algebra  Method  Environment  [Gunnels  et  al.,  2001;  Van  Zee  et  al., 
2009]  is  a  project  that  is  similar  in  principle  to  Spiral,  except  it  targets  the  development  of  dense 
linear  algebra  libraries.  It  uses  a  domain-specific  notation  for  expressing  loop-based  linear  al¬ 
gebra  algorithms,  which  makes  systematic  derivations  for  algorithms,  and  their  optimized  im¬ 
plementations  possible.  Recent  work  allows  parallelization  for  multi-core  architectures  using 
OpenMP  [Van  Zee  et  al.,  2008] . 


1.6.  Thesis  Organization 
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1.6  Thesis  Organization 

The  rest  of  this  thesis  is  organized  as  follows.  Chapter  2  provides  background  on  FFTs,  paral¬ 
lelization,  and  the  prior  work  on  automatic  program  generation  using  Spiral.  Chapter  3  dis¬ 
cusses  the  formalism  for  the  parallelization  and  streaming  of  tensor  products.  Parallelization  and 
streaming  are  the  two  fundamental  architectural  paradigms  considered  in  this  thesis.  Chapter  4 
applies  this  formalism  to  the  derivation  of  FFTs  tuned  for  our  target  platforms.  Optimizations 
for  both  latency  and  throughput  are  presented.  Chapter  5  presents  our  program  generation  sys¬ 
tem,  including  ways  to  produce  code  for  various  programming  paradigms  and  platform  specific 
instructions.  Chapter  6  presents  experimental  results  that  evaluate  the  performance  of  our  gen¬ 
erated  DFT  code  and  benchmarks  it  against  existing  libraries.  Chapter  7  concludes  this  thesis. 


Chapter  2 


Background 


In  this  chapter,  we  present  background  on  the  discrete  Fourier  transform  and  its  fast  algorithms. 
We  first  describe  the  matrix  representation  of  transforms,  and  then  SPL,  a  declarative  represen¬ 
tation  for  transform  algorithms.  Then,  we  present  an  overview  of  existing  fast  algorithms  that 
compute  the  DFT,  including  algorithms  for  parallel  and  vector  machines.  Finally,  we  present 
Spiral,  a  high-performance  automatic  program  and  library  generator  for  linear  transforms.  The 
rest  of  this  thesis  is  built  on  SPL  and  the  Spiral  system  presented  in  this  chapter. 


2. 1  Transforms  and  SPL 

Any  linear  transform  can  be  defined  as  a  matrix- vector  product 

y  =  Mx, 

where  x  and  y  are  the  input  and  output  vectors  respectively,  and  M  is  the  fixed  transform  matrix. 
A  generic  matrix-vector  multiplication  requires  0(n^)  arithmetic  operations.  However,  for  most 
transforms  used  in  signal  processing  and  other  areas,  the  structure  of  the  matrix  can  be  exploited 
to  reduce  the  complexity  to  0(n log  n)  arithmetic  operations.  This  is  done  by  factorizing  the 
transform  matrix  into  a  set  of  structured  sparse  matrices,  and  performing  a  sequence  of  matrix- 
vector  multiplications  on  the  input  vector  with  each  of  these  sparse  matrices. 

In  this  thesis,  we  use  an  abstraction  called  SPL  (Signal  Processing  Language)  [Xiong  et  al., 
2001]  to  represent  such  factorizations.  SPL  is  a  domain  specific  language  derived  from  matrix 
algebra  and  developed  to  describe  transforms  and  their  algorithms.  SPL  is  an  extension  of  the 


16 


Chapter  2.  Background 


Kronecker  product  (or  tensor  product)  formalism  described  in  [Van  Loan,  1992],  and  [Johnson 
etal.,  1990]. 

SPL  as  used  in  this  thesis  consists  of: 


•  predefined  matrices  (e.g.,  F2), 

•  structured  matrices  (e.g.,  identity,  permutation  matrices),  and 

•  matrix  operators  (e.g.,  product,  tensor  product). 


These  are  introduced  in  detail  next. 

Predefined  matrices  include  F2  which  is  also  the  DFT  of  size  2  (DFT2): 


F2  -  DFT2 


1  1 
1  -1 


(2.1) 


and  the  diagonal  matrix,  which  is  filled  with  zeros  except  for  its  diagonal  elements.  For  example, 
diagia,  h,  c,  rf)  is  a  4  x  4  diagonal  matrix  with  the  diagonal  entries  a,  b,  c,  and  d: 


diag(fl,  b,  c,  d)  - 


d 


Before  we  continue,  it  is  important  to  also  view  matrices  from  the  dataflow  perspective.  Visual¬ 
ization  is  a  powerful  aid  when  understanding  and  developing  algorithms,  and  dataflow  graphs 
can  help  us  intuitively  visualize  SPL  expressions,  algorithms,  and  program  loops.  As  an  example, 
we  present  the  dataflow  graph  for  the  transform  F2  as  defined  in  (2.1).  The  matrix- vector  mul¬ 
tiplication  represented  by  (2.1)  is  presented  below  along  with  its  corresponding  dataflow  graph. 
Note  that  the  direction  of  flow  in  these  dataflow  graphs  is  from  right  to  left,  which  corresponds 
with  the  flow  of  the  input  vector  through  the  matrix  to  the  output  vector  when  written  in  matrix 
notation: 


To 

1  1 

Xo 

71. 

1  -1 

.•^1. 

(2.2) 
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The  dataflow  graph  above  expresses  that  yo  is  computed  using  xq  and  xi  (their  sum,  in  this 
case),  and  that  similarly,  yi  uses  Xj  and  Xq.  The  actual  addition  and  subtraction  are  abstracted 
away.  The  shape  of  the  graph  explains  why  F2  is  often  called  the  butterfly  matrix.  We  will  use 
similar  dataflow  graphs  later. 

The  dataflow  graph  above  will  be  the  general  form  of  dataflow  graphs  we  will  use  in  the  re¬ 
mainder  of  this  thesis. 

Structured  matrices  include  the  identity  and  permutation  matrices.  The  identity  matrix  I„ 
is  the  square  nx  n  matrix 


Permutation  matrices  have  exactly  one  single  1  in  each  row  or  column,  and  hence  permute 
the  input  vector.  We  only  consider  permutation  matrices  of  the  form  i™”,  which  transpose  an 
input  vector  of  size  mn,  when  viewed  as  a  m  x  fc  matrix  stored  in  row-major  order.  This  class  of 
permutations  is  called  “stride  permutations”  since  the  output  of  the  permutation  is  equivalent 
to  reading  (or  writing)  the  input  (or  output)  at  a  stride.  For  example. 


and  y  - 1-2  has  the  dataflow 
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y  ^ 


yo«- 


•xo 

Xl 

X2 

X3 

Xi 

X5 

Xe 


y?*- 


-«X7 


The  two  matrix  operators  we  will  consider  in  this  thesis  are  the  product  and  the  tensor  prod¬ 
uct.  The  product  is  simply  defined  as  A- B  =  AB.  This  is  equivalent  to  multiplying  the  input  first 
with  B,  and  the  resulting  vector  hy  A: 


{A-B)x  =  A-{Bx). 

The  tensor  (or  Kronecker)  product,  which  is  fundamental  to  the  representation  of  fast  Fourier 
transforms  and  to  the  approach  used  in  this  thesis  is  defined  as; 


A®B  —  [(i^(B\,  A= 


We  present  a  few  special  cases  of  tensor  products  that  arise  when  one  of  the  two  matrices  is  the 
identity  matrix.  These  are  fundamental  to  this  thesis  for  two  important  reasons:  a)  as  we  will 
see  later,  fast  algorithms  for  the  DFT  consist  mainly  of  such  tensor  products,  and  h)  these  tensor 
products  expose  the  loop  structure  of  such  algorithms  in  a  manner  that  allows  us  to  map  them  to 
architectural  features  of  the  target  hardware  (e.g.,  distributed  memory  parallelism).  We  consider 
four  cases  helow  which  differ  only  in  the  stride  at  which  their  inputs  and  outputs  are  accessed, 
as  shown  in  Table  2.1.  We  present  the  four  cases  using  both  the  matrix  representations  and  as 
corresponding  dataflow  graphs: 

•  /„  i8>  Am  is  a  block-diagonal  matrix,  where  the  transform  Am  is  applied  n  times  to  consec- 
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SPL  Construct 

Output  stride 

Input  stride 

y  —  ® 

1 

1 

y  —  (^m  ® 

n 

n 

y  =  {Ifi  ®  ^ 

1 

n 

y  ~  ®  Afn)^ 

n 

1 

Table  2.1:  Basic  SPL  constructs.  Strides  refer  to  inputs  and  outputs  of  each  of  the  Am  matrices. 


utive  blocks  (of  size  m)  of  the  input  vector: 


A 


m 


In®A  = 


A 


m 


A 


m 


For  example, 


1  1 


/2  (g)  F2  — 


1  -1 

1  1 


1  -1 


and  the  corresponding  dataflow  graph  of  3/  =  (h  ^  F2)  is 


•  Am  ®  In  also  involves  applying  Am  n  times  to  the  input  vector,  but  it  is  applied  to  elements 
read  and  written  at  a  stride  of  n  (as  also  indicated  in  Table  2.1): 
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A®In  = 


For  example, 


■■■ 

■■■ 

1  1 

1  1 

F2  = 

■  1  -1 

1  -1 


,A  =  [ajcf], 


and  the  corresponding  dataflow  graph  of  y  =  {F2 1S172)  is 


y  - - 


•  il„  18)  Am)L"’^  is  a  product  of  the  first  case  and  a  permutation.  This  transform  can  he  ex¬ 
ecuted  in  two  steps,  with  the  permutation  being  applied  first  to  the  input  vector  followed 
hy  the  !„  Am  operation.  However,  to  yield  better  performance,  the  whole  transform  can 
be  applied  in  a  single  step,  with  the  input  vector  being  read  at  a  stride  of  n,  and  the  out¬ 
put  vector  being  written  at  the  regular  unit  stride.  This  makes  it  a  combination  of  the  two 
previous  cases,  as  Table  2.1  shows.  The  example  below  illustrates  both  these  methods: 


(/2®F2)L^ 


1  1 

1 

1  1 

1  -1 

1 

1  -1 

1  1 

1 

1  1 

1  -1 

1 

1  -1 

The  two  dataflow  graphs  corresponding  to  the  first  and  the  last  matrices  above  are: 
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y  ^ 


X 


y  ^ - X 


•  [In  <8>  Am)  is  a  product  of  the  second  case  and  a  permutation.  It  is  a  transposed  version 
of  the  previous  case.  The  input  is  read  at  a  unit  stride  while  the  output  is  written  at  a  stride 
of  n.  Below,  we  show  an  example  of  performing  this  transform  both  using  two  steps,  and 
using  a  single  step: 


4ii2®y2) 


1 

1  1 

1  1 

1 

1  -1 

1  1 

1 

1  1 

1  -1 

1 

1  -1 

1  -1 

The  two  dataflow  graphs  corresponding  to  the  first  and  the  last  matrices  above  are: 


y -4 - X  y -4 - X 


SPL  expressions.  SPL  expressions  or  formulas  used  in  this  thesis  follow  the  grammar  shown 
in  Table  2.2. 
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(formula) 

::=  (matrix)  |  (transform)  | 

(formula)  (formula)  | 

product 

(formula)  ®  (formula) 

tensor  product 

(matrix) 

["cd]\^n\q\- 

(transform) 

::=  DFT„  |  ... 

Table  2.2:  SPL  grammar  [Piischel  et  al.,  2005]  in  Backus-Naur  form  [Harrison,  1978].  An  SPL  formula  is 
either  a  matrix,  transform,  or  is  constructed  using  other  SPL  formulas,  a,  b,  c,  d,  n  and  k  are  integers. 


=  C~'^b'^ 

(2.3) 

[A»B)'^  =  a'^  ^^b'^ 

(2.4) 

Imn  —  Im  ®  In 

(2.5) 

A»  B  =  [A»  B) 

(2.6) 

A0(BC)  =  (A0B)(A®C) 

(2.7) 

A0B  =  L^“(B0A)I^” 

(2.8) 

( j  mn^-l  _  j  mn 

(2.9) 

jkmn  irkn^j  -./t  „  rmn. 

(2.10) 

jkmn  „Tmn^fjkn^j  , 

Bkm  )iLf.  0/m) 

(2.11) 

Table  2.3:  SPL  formula  identities  to  manipulate  SPL  constructs.  A  is  n  x  n,  and  B  and  C  are  mx  m.  is 
the  transpose  of  A. 


2.1.1  SPL  Identities 

SPL  expressions  can  be  manipulated  using  mathematical  identities.  We  present  these  identities 
in  Table  2.3.  We  will  use  these  and  develop  others  in  this  thesis  to  manipulate  fast  algorithms 
expressed  in  SPL  to  better  “fit”  target  architectures. 

This  section  has  introduced  the  matrix  representation  of  transforms,  and  SPL.  Next,  we  show 
how  SPL  is  used  to  represent  fast  algorithms  for  the  DPT. 


2.2  Discrete  and  Fast  Fourier  Transforms 

In  this  section,  we  first  define  the  DPT  and  its  most  important  fast  algorithm:  the  general-radix 
Cooley- Tukey  fast  Pourier  transform  (PPT).  Then  we  discuss  variants  of  the  DPT,  and  put  termi¬ 
nology  used  in  the  literature  in  the  context  of  this  thesis. 

DPT  definition.  The  DPT  of  n  input  samples  Xq,  . . . ,  Xn-i  is  defined  in  summation  form  as 

yk=  Y.  0<k<n, 

0<:f<n 
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Figure  2.1:  Cooley- Tukey  FFT:  matrix  structure,  dataflow  visualization  from  [Franchetti  et  al.,  2009b]. 
Each  point  on  the  dataflow  visualization  is  a  complex  element.  The  dataflow  visualization  contains  4 
stages,  which  correspond  to  the  stages  shown  in  the  matrix  structure  visualization. 


with  (i)„-e  2^7/'!  (a  primitive  u*  root  of  unity).  In  the  matrix  form,  the  DFT  is  written  as: 

j/^DFT„x,  DFT„  ^ 

Cooley-Tukey  FFT.  The  first  “fast”  algorithm  for  the  DFT  was  discovered  by  Cooley  and  Tukey  [Coo¬ 
ley  and  Tukey,  1965]  \  and  is  often  referred  to  as  “the”  FFT  It  is  expressed  in  SPL  as: 

DFTm„  -  (DFTOT«>/n)D„,m(/mi8)DFT„)L”",  (2.12) 

where  D„  ^  is  the  diagonal  matrix  containing  the  twiddle  factors,  defined  in  [Van  Loan,  1992] . 

To  aid  an  intuitive  understanding,  we  show  in  Figure  2. 1  the  matrix  and  dataflow  representations 
of  a  DFT  of  size  16  (16  input  samples)  performed  using  the  FFT  in  (2.12)  with  m  =  n  =  4:. 

Note  that  the  FFT  algorithm  in  (2.12)  is  recursive:  it  factorizes  the  DFT  into  a  product  of 
factors  that  contain  smaller  DFTs.  For  n  a  power  of  two,  the  base  case  is  DFT2  =  F2,  defined 
in  (2.2). 

^The  algorithm  was  already  known  around  1805  to  Carl  Friedrich  Gauss,  hut  his  work  was  not  widely  recog¬ 
nized  [Heidemanet  al.,  1985]. 

^It  is  important  to  note  the  distinction  between  the  terms  “DFT”  and  “FFT.”  The  former  refers  to  the  problem,  whUe 
the  latter  refers  to  an  algorithm.  Literature  often  fails  to  make  this  distinction,  which  can  be  a  source  of  confusion. 
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Several  other  FFTs  exist.  These  include  the  Bluestein  or  Winograd  FFT  (for  any  input  size  n 
including  non-2  power  sizes),  the  Rader  FFT  (for  prime  n),  and  the  prime-factor  FFT  (used  when 
n  is  a  product  of  coprime  factors)  [Van  Loan,  1992] .  These  algorithms  are  typically  used  for  small 
sizes  (<  32)  as  the  hase-case  for  the  FFT  recursion,  and  are  not  relevant  to  this  thesis.  In  this 
thesis,  we  focus  on  two-powers  n  =  2^  and  variants  of  the  Cooley- Tukey  FFT  that  work  especially 
well  with  our  target  architectures. 

DFTs  of  higher  dimensions,  especially  2  and  3-dimensions,  are  also  used  widely,  and  are 
naturally  expressed  within  the  Kronecker  product  formalism  simply  as  the  tensor  product  of  two 
or  more  ID  DFTs: 


DFTmxn  =  DFTm<8'DFT„,  (2.13) 

DFTfcxmxn  =  DFTfc(8iDFTmi8)DFT„.  (2.14) 

The  row-column  algorithm  for  breaking  down  multidimensional  DFTs  factorizes  the  tensor 
product  of  DFTs  into  tensor  products  of  DFTs  with  identity  matrices.  For  example, 

DFTfnx  n  —  DFTnj  i8>  DFTjj  —  (DFTm  ®In)  (fm  ®  DFT^).  (2.15) 

Variants.  In  practice,  applications  may  require  variants  of  the  DFT.  Most  of  these  are  similar 
to  the  standard  DFT,  and  hence,  fast  code  for  the  standard  DFT  can  usually  he  adapted  with 
minor  modifications.  We  present  the  most  important  variants  here: 

•  Forward/inverse.  A  forward  DFT  goes  converts  the  input  in  time  domain  to  frequency 
domain,  and  vice  versa.  In  terms  of  computation,  the  inverse  DFT  (with  output  scaled  by 
UN  where  N  is  the  problem  size)  is  the  same  as  the  forward  DFT  with  the  exception  of  the 
twiddle  factors  used.  Performance  is  typically  the  same  or  very  close  for  the  forward  and 
inverse  DFTs  as  the  number  of  operations  and  the  access  patterns  remain  the  same.  Thus, 
program  generation  systems  or  DFT  libraries  can  simply  switch  twiddle  values  to  reverse 
direction.  The  performance  numbers  presented  in  this  thesis  apply  to  both  forward  and 
inverse  DFTs. 

•  Complex/real-input  data.  The  input  to  the  DFT  is  a  vector  of  complex  numbers.  Special¬ 
ized  algorithms  exist  to  optimize  DFTs  for  real  input  (where  the  imaginary  parts  are  all  0), 
but  are  outside  the  scope  of  this  thesis. 

•  Interleaved/split-complex  format.  Two  major  data  formats  exist  for  representing  com¬ 
plex  arrays.  Arrays  in  the  interleaved-complex  format  are  comprised  of  alternating  real 
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and  imaginary  values,  where  a  complex  number  is  represented  by  two  successive  values 
in  the  array.  Arrays  in  the  split-complex  format  are  comprised  of  all  real  values  grouped 
together,  followed  by  the  imaginary  values.  By  default,  we  assume  input/ output  using 
the  interleaved-complex  format.  Split-complex  format  is  handled  by  placing  appropri¬ 
ate  stride  permutations  at  the  beginning  and  the  end  of  the  transform,  and  pushing  these 
permutations  into  inner  loops  to  mitigate  or  avoid  the  cost  of  performing  them  explicitly. 

•  Bit-reversed  output.  When  the  output  of  a  DFT  is  said  to  be  “bit-reversed”,  it  means 
the  output  array  is  permuted  in  such  a  fashion  that  their  index  bits  are  reversed  (hence 
the  term  “bit-reversal  permutation”).  In  some  applications  like  a  DFT-based  convolution, 
when  performing  a  forward  DFT,  followed  by  a  set  of  operations,  followed  by  an  inverse 
DFT,  the  order  of  the  output  either  does  not  matter,  or  the  intervening  operations  can  be 
modified  at  no  cost  to  handle  bit-reversed  output.  A  permutation  step  may  be  saved  if  bit- 
reversed  output  is  acceptable.  This  may  provide  significant  runtime  savings,  especially  in 
distributed  memory  systems  where  inter-node  permutations  are  expensive.  It  is  important 
to  keep  this  in  mind  when  comparing  performance  results  across  DFT  libraries. 

Relevant  terms  used  in  literature.  The  DFT- related  terminology  in  literature  can  sometimes 
be  ambiguous.  We  list  relevant  terminology  here  as  a  reference  and  to  relate  our  presentation 
with  the  existing  literature. 

•  Decimation  in  time  /  decimation  in  frequency.  The  DFT  is  symmetric:  DFT„^  =  DFT„. 
(2.12)  is  called  the  recursive  decimation-in-time FFT,  while  its  transpose  (using  (2.3), (2.4)) 
is  called  the  recursive  decimation-in-frequency  FFT,  and  is  shown  below: 

(/,7j  iSi  DFT,2)Dn,m(DFTni  <8i/,2), 

•  Radices,  radix- r  FFT.  The  value  of  n  in  (2.12)  is  the  radix  of  the  Cooley- Tukey  decomposi¬ 
tion  for  that  level.  If  the  same  radix  is  chosen  recursively,  the  resulting  algorithm  is  called 
a  radix- r  algorithm  (necessarily,  n  -  in  this  case).  Mixed-radix  algorithms  are  used  for 
sizes  that  are  not  powers  of  a  suitable  radix  or  for  performance  reasons. 

•  Corner  turns,  matrix  transpositions.  A  corner-turn  (named  so  because  it  “turns”  the  ma¬ 
trix  around  a  corner)  or  a  matrix  transposition  is  simply  a  stride  permutation  as  defined  in 
Section  2.1. 

•  Transposed  output.  This  usually  means  bit-reversed  output,  as  explained  above. 
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2.3  FFTs  for  Parallel  Architectures 

So  far,  we  have  introduced  the  Cooley- Tukey  FFT  in  SPL.  In  this  section,  we  present  other  FFTs 
that  were  designed  for  parallel  architectures.  All  these  FFTs  can  be  derived  from  (2.12). 

Parallelism  in  platforms:  a  historical  view.  High-performance  DFT  libraries  have  long  been 
used  on  a  wide  range  of  platforms.  In  the  past,  platforms  could  be  broadly  categorized  into  desk¬ 
top  and  supercomputing  platforms.  Desktop  platforms  in  the  past  had  minimal  or  no  support  for 
parallelism  (1970-1996)  after  which  they  were  restricted  to  short-vector  parallelism  (1996-2004). 
Performance  bottlenecks  for  the  DFT  on  these  platforms  was  primarily  rooted  in  the  memory 
hierarchy.  However,  the  latest  desktop  architectures  are  based  on  cache -coherent  multicore  de¬ 
signs  (e.g.,  Intel  Quad-core,  and  AMD  Opteron,  Phenom).  Since  we  build  upon  some  ideas  used 
on  parallel  DFTs  on  these  machines,  we  present  them  here. 

Supercomputing  platforms  have  included  various  forms  of  parallelism  including  long-vector 
instructions  and  multi-node  parallelism.  Since  these  platforms  most  closely  resemble  the  dis¬ 
tributed  memory  parallel  architectures  this  thesis  considers,  we  look  closely  at  FFTs  that  have 
been  used  on  them  in  the  past. 

Below,  we  present  several  types  of  historical  and  existing  parallel  DFT  algorithms,  including 
their  strengths  and  weaknesses. 

Four-step  FFT.  The  four-step  algorithm  [Hegland,  1994;  Norton  and  Silberger,  1987;  Van  Loan, 
1992]  was  developed  for  supercomputers  which  provided  parallelism  in  the  form  of  long  vector 
instructions.  It  produces  the  longest  possible  unit  stride  vector  operations  at  the  cost  of  a  trans¬ 
position.  It  is  a  simple  manipulation  of  the  general  Cooley- Tukey  FFT  (2.12)  using  (2.8): 

DFTmn  -  (DFTm  iS)7„)D„,TOf'm”(DFT„  ®Im)-  (2.16) 

The  algorithm  is  named  so  because  of  the  four  factors  seen  in  (2.16).  This  algorithm  is  pri¬ 
marily  used  on  vector  architectures,  but  is  inefficient  on  multicore  or  multi-node  architectures. 

Six-step  FFT.  The  six-step  algorithm  was  developed  for  distributed  memory  machines.  It  is 
still  used  on  such  machines  along  with  variants.  The  six-step  is  again  a  simple  modification  of 
the  general  Cooley- Tukey  FFT  (2.12)  using  (2.8)  that  yields  six  factors,  three  of  which  are  global 
transpositions  (which  become  global  all-to-all  data  exchanges): 

D¥Tmn  =  L’^"Un®B¥Tm)L^"Dn,miIm^'D¥Jn)L’Z".  (2.17) 

The  main  issue  with  the  srx-step  algorithm  is  the  high  implementation  cost  of  the  permuta¬ 
tions.  The  algorithm  is  used,  for  example,  in  the  FFTE  library  [Takahashi,  2002] . 
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Iterative  Algorithm:  triple-loop.  There  also  exist  iterative  algorithms  for  the  DFT  which 
can  he  derived  hy  recursively  expanding  the  Cooley- Tukey  FFT  and  applying  identities  from  Ta¬ 
ble  2.3.  The  simplest  iterative  FFT  is  the  decimation  in  time  triple-loop  [Cooley  and  Tukey,  1965; 
Van  Loan,  1992]; 


(2.18) 


Ic  Ic 

is  the  radix- r  digit  reversal  permutation  and  Dl  contains  the  twiddle  factors  in  the  ith 


stage.  A  radix-2  version  is  implemented  by  Numerical  Recipes  [Press  et  al.,  1992] . 

Iterative  algorithm:  Pease.  A  variant  of  (2.18)  is  the  Pease  FFT  [Pease,  1968;  Van  Loan,  1992], 
which  has  constant  geometry,  i.e.,  the  control  flow  is  independent  of  the  stage.  It  requires  an 
explicit  digit  reversal  permutation  at  the  end,  which  may  be  expensive  to  implement.  It  was 
originally  developed  for  parallel  computers,  and  its  regular  structure  makes  it  a  good  choice  for 
field-programmable  gate  arrays  (FPGAs)  or  ASICs: 

Iterative  algorithm:  Stockham.  The  StockhamWT  [Schwarztrauber,  1987;  Van  Loan,  1992] 
is  a  self-sorting  algorithm  which  means  that  it  does  not  require  a  digit  reversal  permutation  at 
the  end.  It  was  also  originally  developed  for  vector  computers: 


Distributed  memory  FFT.  Bonelli  et  al.  [2006]  derive  an  FFT  for  a  distributed  memory  plat¬ 
form  with  p  processors.  First  the  stride  permutation  in  (2.12)  is  factored  using  the  following 
identity: 


T  mn 


-  ®  Imnlp^)ilp  ®  ip  ®  Imlp). 


Then,  (2.12)  is  manipulated  to  yield: 


DFTmn  -  L';^"{In®'D^Tm)LTDn.mUm^'D¥Tn)LT 

Up  ^L^p^  Imlp)  Up  ^Inlp^  DFT„)  Up  ®  L™;/)  uf  ®  Im„lp2) 

Up'^Lp  ^  Ifi!  p)Dm,nUp®  Im!  p  n)Up®  L-m!  '^^mnlp'^) 

Up  ®  Lp  ^  Imlp) 


The  resulting  algorithm  has  four  “macro -stages”  (shown  one  per  line),  mutually  separated  by 
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three  all  to  all  data  exchanges  {Lp  The  algorithm  was  intended  primarily  for  MPI-hased 

clusters,  and  is  unahle  to  take  advantage  of  architectures  that  can  reduce  data  transfer  costs  hy 
performing  communications  in  the  background.  It  also  does  not  allow  for  SIMD  vectorization  in 
its  original  form. 

Multicore  FFT.  Franchetti  et  al.  [2006a]  derive  an  FFT  for  a  shared  memory  multicore  plat¬ 
form  with  p  processors  and  cache  block  size  p: 

DFTm«  —  {[Ljy/’  ®  Inipfj)  ®  (fp  ®  (DFT,j2  ®Inlp))  ((ip  ^  ®  Inipp) 

X  [lp0[I,„ip^DFTn)LZilf)[i4''  ®  Imlpp)^  Ip).  (2.19) 

Inspection  of  (2.19)  shows  that  all  permutations  have  the  form  {A®  Ip  which  always  ensures 
that  cache  lines  are  accessed  in  blocks  of  size  p,  and  avoids  false  sharing.  Further,  all  computa¬ 
tion  steps  have  the  form  Ip  i8>  A,  which  ensures  that  these  are  load  balanced. 

In  this  section,  we  have  seen  FFTs  for  parallel  architectures  at  a  mathematical  level  of  ab¬ 
straction.  We  now  present  Spiral,  a  system  that  can  generate  high  performance  code  based  on 
such  abstractions. 


2.4  Spiral 

Overview.  We  presented  several  fast  algorithms  for  the  DFT  based  on  the  FFT.  We  now  discuss 
Spiral,  a  system  that  was  developed  to  reap  the  practical  benefits  of  algorithm  exploration.  This 
thesis  builds  upon  and  extends  Spiral  to  generate  fast  code. 

Spiral  [Piischel  et  al.,  2005]  is  a  program  generator  that  autonomously  generates  a  high- 
performance  software  library  for  some  selected  functionality  and  target  platform.  Spiral  origi¬ 
nally  focused  on  linear  transforms,  and  in  particular  the  DFT.  However,  latest  developments  ex¬ 
pand  Spiral  beyond  this  domain  to  include  libraries  for  coding,  linear  algebra  (matrix- matrix- 
multiplication),  and  image  processing  [de  Mesmay  et  al.,  2010;  Franchetti  et  al.,  2008].  While 
Spiral  still  only  supports  restricted  domains,  the  generated  code  is  very  fast  and  often  rivals 
the  performance  of  expertly  hand-tuned  implementations.  We  limit  the  further  discussion  on 
Spiral  to  the  DFT. 

At  the  heart  of  Spiral  is  the  declarative  representation  of  algorithms — SPL  (discussed  ear¬ 
lier),  symbolic  computation  (rewriting  systems),  and  high-level  architecture  models.  Spiral  rep¬ 
resents  algorithms  and  hardware  models  in  SPL  An  intelligent  search  mechanism  accounts  for 
hardware  details  that  are  difficult  to  model  (such  as  cache-based  systems).  Algorithms  and  pro¬ 
gram  transformations  are  encoded  as  rewriting  rules  that  operate  on  SPL  formulas  to  produce 
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Figure  2.2:  Spiral’s  program  generation  system. 


code. 

2.4.1  System  Structure  and  Sequential  Code  Generation 

Spiral’s  structure  is  shovm  in  Figure  2.2,  which  we  discuss  in  detail  helow. 

Input.  The  input  to  Spiral  is  a  prohlem  specification  containing  the  requested  functional¬ 
ity  and  some  target  hardware  properties,  for  instance  “generate  a  8,192-point  DFT  using  4  Cell 
SPEs,”  appropriately  encoded  in  Spiral’s  internal  language. 

Algorithm  Level.  Breakdown  rules  in  Spiral  are  used  to  represent  algorithms  for  the  DFT 
that  break  down  larger  transforms  into  smaller  kernels.  These  smaller  kernels  are  recursively 
broken  down  into  even  smaller  kernels  using  breakdown  rules  at  each  level,  until  the  recursion’s 
base  cases  are  reached.  The  final  algorithm  typically  involves  various  DFT  algorithms  applied 
at  each  level  of  the  recursion.  This  also  means  that  a  large  space  of  algorithms  (formulas)  for  a 
single  transform  may  be  obtained  using  these  breakdown  rules.  Next,  a  rewriting  system  uses  a 
set  of  rewrite  rules  to  structurally  optimize  the  formula  thus  obtained  to  match  the  target  archi¬ 
tecture. 

Implementation  Level.  Algorithms  in  SPL  can  be  directly  translated  into  sequential  C  code. 
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discussed  in  more  detail  in  Section  2.4.3.  At  the  implementation  level,  an  SPL  compiler  trans¬ 
lates  the  formula  output  into  optimized  C  code,  possibly  including  intrinsics  (for  vectorized 
code)  [Franchetti  et  al.,  2006h]  and  threading  instructions  (for  multicore  parallelism)  [Franchetti 
etal.,  2006a]. 

Evaluation  Level.  The  C  code  thus  obtained  is  compiled  by  a  traditional  backend  compiler 
(such  as  gc  c) .  The  performance  of  this  implementation  is  measured  and  used  in  a  feedback  loop 
to  search  over  algorithmic  alternatives  for  the  fastest  one.  Searching  over  an  algorithm  space  is 
primarily  useful  for  architectural  components  such  as  the  memory  hierarchy  which  are  difficult 
to  model. 

Output.  The  final  output  is  a  high  performance  platform  adapted  implementation  of  the 
transform  requested  by  the  user. 

2.4.2  Parallel  and  Vector  Code  Generation 

A  key  observation  is  that  SPL’s  tensor  product  representation  of  the  transform  algorithms  can 
be  mapped  to  components  of  the  target  architecture.  This  is  because  the  tensor  product  can  be 
viewed  as  a  program  loop,  where  the  structure  of  the  loop  is  made  explicit.  Using  well  known 
formula  identities  recast  as  rewriting  rules.  Spiral  manipulates  algorithms  in  SPL  to  match  the 
target  architecture  [Franchetti  et  al.,  2006a,b],  thus  deriving  high-performance  implementations 
of  the  loop.  We  now  discuss  code  generation  for  parallel  and  vector  architectures  in  further  detail. 

Architectural  paradigms.  Target  architectural  features  like  multicore  parallelism  and  SIMD 
vector  units  that  require  optimization  of  the  algorithm  structure  are  called  architectural  paradigms 
and  are  first  identified.  The  main  idea  then  is  to  find  SPL  expressions  that  map  naturally  well 
onto  each  of  the  paradigms  of  the  target  architecture.  Thus,  these  formula  constructs  represent 
a  high-level  model  of  the  target  architecture.  Remaining  SPL  expressions  contained  in  the  DFT 
are  converted  into  these  natural  constructs  using  rewrite  rules  that  must  be  identified.  The  con¬ 
version  typically  comes  at  a  cost,  which  might  be  vector  permutations,  synchronization  barriers, 
or  additional  index  computations.  Below,  we  look  at  two  examples  of  SPL  expressions  that  map 
well  to  architectural  paradigms:  SIMD  vectorization  and  multicore  parallelization. 

SIMD  vectorization.  Single  Instruction  Multiple  Data  architectures  can  perform  the  same 
instruction  on  multiple  data  input  in  the  same  cycle.  The  main  constraint  in  such  architectures 
is  that  SIMD  instructions  are  setup  to  work  only  on  data  that  is  consecutive  in  memory.  Due 
to  this  constraint,  the  most  natural  fit  for  vectorization  is  the  Am  In  construct  which  was  pre¬ 
sented  in  Section  2.1,  assuming  n>v  and  v  divides  n  (where  v  is  the  SIMD  vector  length).  The 
pseudocode  and  dataflow  graph  below  illustrate  why  this  formula  construct  naturally  fits  the 
paradigm.  The  highlighted  area  in  the  dataflow  graph  represents  a  single  4-way  vector  Instruc- 
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tion.  The  values  xo,Xi,X2,  X3  are  loaded  in  a  vector  register  using  a  single  load  instruction.  The 
values  X4,X5,X6,X7  are  loaded  in  a  second  register,  and  a  single  SIMD  add  instruction  on  these 
registers  yields  4  output  data  points  (yo,  yi,  y2>  ya): 


vec (yO  , 

yi. 

y2. 

y3)  = 

=  vec (xO  ,  xl  , 

x2  ,  x3)  +  vec(x4,  x5  , 

x6  , 

x7)  ; 

vec (y4  , 

y5 . 

y6 . 

y7)  = 

=  vec (xO ,  xl  , 

x2 ,  x3)  -  vec(x4,  x5  , 

x6  , 

x7)  ; 

Other  SPL  constructs  are  rewritten  to  fit  this  construct.  For  example,  I„  iSi  Am  can  he  rewritten 
using  (2.8).  The  cost  of  rewriting  are  the  additional  permutations  which  are  implemented  as 
a  combination  of  vector-block-level  re-indexing,  and  in-vector  permutations  [Franchetti  et  al., 
2006b]. 

Multicore  parallelism.  Since  this  involves  simultaneously  performing  computations  on  more 
than  one  processor  core,  the  dataflow  graph  must  be  partitioned  to  allow  for  this.  Any  loop 
where  iterations  are  completely  independent  of  each  other  is  a  data-parallel  loop  that  fits  this 
paradigm.  The  most  natural  SPL  construct  that  does  this  is  the  I„  <s>  Am  construct  which  was  pre¬ 
sented  in  Section  2.1.  Each  core  of  p  cores  works  on  n/p  iterations  independent  of  other  cores. 
The  dataflow  graph  below  shows  an  example  of  U  is>  F2  parallelized  across  4  processors,  as  high¬ 
lighted  by  the  boxes.  The  graph  implements  the  following  pseudocode,  with  each  fine  executing 
on  a  different  processor: 


yO  =  xO 

-F 

xl  ; 

II 

X 

0 

-  xl  ; 

// 

Execut es 

on 

p  0 

y2  =  x2 

-F 

x3  ; 

y3  =  x2 

-  x3  ; 

// 

Executes 

on 

pi 

y4  =  x4 

-F 

x5  ; 

y5  =  x4 

-  x5  ; 

// 

Execut es 

on 

p2 

y6  =  x6 

-F 

x7  ; 

y7  =  x6 

-  x7; 

// 

Execut es 

on 

p3 
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Other  SPL  constructs  are  rewritten  to  fit  this  construct.  For  example,  Am  81  /„  can  he  rewritten 
using  (2.8).  The  cost  is  the  set  of  additional  permutations,  which  are  implemented  as  re-indexing 
inside  the  parallel  loop. 

In  reality,  rewriting  for  parallelization  is  more  involved  because  of  cache  memory  consider¬ 
ations.  False  sharing  would  occur  if  data  within  the  same  cache  line  was  processed  hy  different 
cores.  Thus,  the  high-level  rewriting  must  account  for  this,  as  presented  in  detail  in  [Franchetti 
et  al.,  2006a] .  This  last  observation  illustrates  an  important  point:  rewriting  for  one  architectural 
paradigm  may  produce  a  conflict  for  another  paradigm  on  the  same  target  architecture.  We  will 
see  this  phenomenon  reoccur  at  various  points  in  this  thesis. 

Rewriting  for  Target  Architectures.  A  platform  may  have  multiple  architectural  paradigms 
that  our  algorithm  needs  to  take  advantage  of.  For  example,  for  a  multicore  Intel  CPU  like  the 
Xeon  Quadcore,  the  algorithm  must  be  both  parallelized  and  vectorized.  Adapting  the  SPL  for¬ 
mula  to  the  target  architecture  is  accomplished  using  a  tagging  system  within  Spiral.  The  input 
to  Spiral  discussed  in  Section  2.4.1  can  be  tagged  with  architectural  parameters  such  as  par¬ 
allelization  and  vectorization  (tags  can  be  nested  when  required).  Tags  are  removed  as  Spiral 
successively  uses  rewrite  rules  (and  breakdown  rules,  as  discussed  in  Section  2.4.1)  to  modify 
tagged  SPL  constructs  into  combinations  of  constructs  that  naturally  fit  the  architecture,  and 
additional  constructs  which  are  the  cost  of  such  conversions.  At  this  point,  the  formula  is  ready 
to  be  converted  into  code.  Below,  we  show  an  example  of  a  rewrite  rule  that  rewrites  a  tagged 
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SPL  construct  into  its  untagged  form: 

In  ^  Afn  '  /p  iSi||  Inip  Am  (2.20) 

par(p,^) 

A  construct  of  the  form  Ip^\\A  specifies  that  A  is  executed  in  parallel  across  multiple  processors. 
The  actual  implementation  is  not  specified  at  this  level,  and  is  accomplished  using  a  program¬ 
ming  paradigm  that  is  best  suited  to  the  target  architecture  (e.g.,  pthreads  or  OpenMP). 


2.4.3  Looped  Code  Generation  and  Z-SPL 

So  far,  we  have  seen  how  Spiral  generates  simple,  sequential  code  for  uniprocessors  and  for 
vector  and  parallel  processors.  However,  these  techniques  do  not  scale  to  larger  transform  sizes, 
for  which  looped  code  must  he  generated.  The  main  idea  in  generating  high  performance  FFT 
looped  code  is  to  fuse  permutation  constructs  into  adjoining  kernel  constructs  so  as  to  avoid 
explicit  copying  steps  in  the  final  code,  which  are  expensive.  However,  this  is  difficult  to  accom¬ 
plish  using  SPL.  Specifically,  this  is  because  SPL  does  not  distinguish  between  kernel  constructs 
that  perform  computations,  and  permutation  constructs  that  perform  only  memory  loads  and 
stores. 

Z-SPL,  a  system  for  formal  loop  merging,  was  developed  to  address  this  issue.  Z-SPL  [Franchetti 
et  al.,  2005]  is  an  extension  of  SPL  that  makes  loops,  and  memory  accesses  as  function  of  loop 
variables  explicit,  while  retaining  the  matrix  abstraction.  It  can  be  seen  as  an  abstraction  layer 
between  SPL  and  actual  C  code.  Z-SPL  adds  the  iterative  sum  of  matrices  and  matrices  param¬ 
eterized  by  index  mapping  functions  (gather  and  scatter  matrices)  to  SPL.  Thus,  Z-SPL  is  made 
up  of  constructs  that  explicitly  represent  compute  kernels  (called  skeletons),  and  constructs  that 
represent  their  associated  loads  and  stores  in  the  form  of  gather  and  scatter  matrices  (called  dec¬ 
orations).  This  allows  it  to  concisely  describe  loop-carried  access  patterns  inside  a  matrix-based 
algorithm  representation. 

In  addition  to  using  it  to  produce  looped  code,  in  this  thesis,  we  also  extend  Z-SPL  to  expose 
and  manipulate  details  about  loads  and  stores  in  order  to  incorporate  explicit  memory  access 
management.  We  first  provide  an  intuitive  introduction  to  Z-SPL,  followed  by  a  formal  defini¬ 
tion.  As  an  illustrating  example,  consider  the  transform  I2  iS>  A2  for  an  arbitrary  2x2  matrix  A, 
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that  operates  on  an  input  vector  of  length  4.  This  construct  can  he  written  as: 


72®  A  = 


1  • 

•  1 

.  .  1  . 

-1- 

A 

...  1 

1  • 

•  1 

1 


y^.  S;  Affi  G;, 
1=0 


(2.21) 


where  the  dots  represent  zero  entries.  In  each  of  the  summands,  the  two  vertically  long  ma¬ 
trices,  called  the  gather  matrices,  select  the  elements  of  the  input  vector  that  Am  works  on,  and 
the  horizontally  long  matrices,  called  the  scatter  matrices,  select  the  part  of  the  output  vector  the 
summand  writes  to,  and  can  thus  he  parameterized  as  shown  in  (2.21). 

More  formally,  Z-SPL  defines  matrices  parameterized  by  index  mapping  functions,  which 
map  integer  intervals  to  integer  intervals.  An  index  mapping  function  /  with  domain  {0, . . . ,  n  - 1} 
and  range  {0, . . . ,  Ai  -  1}  is  written  as  ;  i  ^  f[i).  For  convenience,  we  omit  the  domain  and 
range  where  it  is  clear  from  the  context.  We  now  introduce  the  two  index  mapping  functions 
used  in  this  thesis, 

■  i'-^b+is,  and  (2.22) 

(b+  [i/pj  s)p  -t  [i  mod  p)  (2.23) 


which  abstract  strided  memory  access  at  scalar  and  packet  granularities,  respectively  {b,  s, 
and  p  correspond  to  base,  stride,  and  packet  size).  Index  mapping  functions  are  used  to  pa¬ 
rameterize  gather  and  scatter  matrices,  which  encode  data  movement  (loads  and  stores).  Let 
e”  e  £«xi  be  the  column  basis  vector  with  the  1  in  k-th  position  and  0  elsewhere.  The  gather 
matrix  for  the  index  mapping  is 


■- 


N 

f(n-l) 


T 


Gather  matrices  thus  gather  n  elements  from  an  array  of  N  elements.  Scatter  matrices  are  simply 
transposed  gather  matrices,  S(/”^^)  =  G(/)^. 

The  iterative  matrix  sum,  which  encodes  loops,  is  conventionally  defined.  However,  it  does 
not  incur  operations  since  by  design,  each  of  the  summands  produce  a  unique  part  of  the  output 
vector.  Based  on  our  formal  definitions,  our  previous  example  (2.21)  is  expressed  in  Z-SPL  using 
index  mapping  functions  to  parameterize  gathers  and  scatters  as  ^(^22,1)  Am  G(fi2!,i)- 

Code  Generation.  Since  Z-SPL  formulas  essentially  represent  loops,  converting  them  into 
G  code  is  straightforward.  Gather  matrices  read  from  the  input  vector,  which  the  kernel  then 
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SPL 


Z-SPL 


y=(AnBn)x 

y  —  {In  ®  Affi]x 


y  =  {AnBn]x 


(  n-1 


y  = 


\j=0 


y  —  (^m  ® 


(  n-1 


y  = 


\j=0 


y=iIn^Am)L^^x 


(  n-1 


y  = 


\j=0 


Code 

tCO:l:ii-l]  =  B(x[0:l:n-1]); 

yCO:l:ii-l]  =  A(t  [0 : 1  :n-l]  ; ) 

f or (i=0; i<n;i++) 

y [i*n: 1 : i*m+m-l]  = 

A(x  [i*ii:  1 :  i*m+m-l]  )  ; 

f or (i=0; i<n;i++) 
y  [i :n: i+m-l]  = 

A(x [i :n: i+m-l] ) ; 

f or (i=0; i<n;i++) 

y [i*n: 1 : i*m+m-l]  = 

A(x  [i  :ii:  i+m-l]  )  ; 


Table  2.4:  Translating  SPL  to  Z-SPL,  and  then  to  code,  x  denotes  the  input  and  y  the  output  vector.  The 
subscripts  of  A  and  B  specify  the  size  of  the  (square)  matrix.  We  use  MATLAB-like  notation:  x[b:s:e] 
denotes  the  subvector  of  x  starting  at  b,  ending  at  e  and  extracted  at  stride  s. 


operates  on.  Scatter  matrices  then  write  to  the  output  vector.  The  matrix  sum  becomes  the  actual 
loop.  Table  2.4  shows  how  the  fundamental  Z-SPL  constructs  are  derived  from  SPL  constructs, 
and  then  translated  to  C  code.  By  applying  these  rules  recursively,  any  Z-SPL  formula  can  be 
translated  into  C  code. 


2.4.4  General-size  Library  Code  Generation:  Autolib 

So  far,  we  have  looked  at  automatic  generation  of  code  for  fixed  sizes  only  (the  transform  size  is 
fixed  at  problem  specification  time).  Library  generation  for  general-size  code,  called  “Autolib”, 
is  a  new  addition  to  Spiral  that  enables  the  automatic  generation  of  transforms  where  the  size 
parameter  is  known  only  at  runtime  [Voronenko,  2008;  Voronenko  et  al.,  2009]. 

Autolib  can  generate  libraries  similar  to  FFTW  [Frigo  and  Johnson,  2005] .  The  main  advan¬ 
tage  of  such  a  system  is,  it  obviates  the  need  to  redesign  or  rewrite  a  library  when  major  changes 
need  to  be  made.  For  instance,  when  a  target  architecture’s  set  of  paradigms  changes,  or  when  a 
new  paradigm  needs  to  be  targeted,  or  when  a  new  algorithm  needs  to  be  added,  one  can  simply 
make  high-level  changes  to  the  library  generator,  and  use  it  to  produce  a  new  library. 

In  addition,  an  automatic  library  generation  system  offers  several  other  benefits.  Appro¬ 
priate  libraries  can  be  easily  be  generated  for  different  platforms  that  use  similar  architectural 
paradigms  but  need  different  sets  of  instructions  (e.g.,  Intel’s  SSE  vector  instruction  intrinsics  are 
slightly  different  from  IBM’s  AltiVec  instruction  set,  although  both  are  SIMD  vector  paradigms). 
Autolib  can  also  generate  smaller  libraries  for  platforms  where  code  size  limitations  exist  (trad- 
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ing  off  performance).  It  can  also  generate  a  single  library  for  a  given  set  of  transforms  (e.g.,  the 
discrete  Fourier,  Sine,  and  Cosine  transforms),  while  reusing  code  across  transforms.  We  will 
limit  further  discussion  of  Autolib  to  the  aspects  pertinent  to  this  thesis. 

Design.  Autolib -generated  libraries  are  composed  of  recursive  functions  that  call  themselves 
and  each  other  in  a  recursive  closure.  Such  a  mutually  recursive  closure  derives  from  the  re¬ 
cursive  nature  of  the  transform  algorithms.  A  major  component  of  Autolib  is  to  automatically 
determine  this  recursion  step  closure  from  a  given  set  of  transforms,  algorithms,  and  rules. 

A  significant  difference  between  fixed-size  program  generation  and  general  size  library  gen¬ 
eration  is  that,  instead  of  applying  (and  searching  over)  breakdown  rules  at  program  generation 
time,  the  rules  must  be  “compiled”  into  code  which  can  then  be  applied  (and  searched  over)  at 
runtime.  This  is  because  rules  have  different  applicability  conditions  based  on  the  transform 
size,  which  is  not  known  at  library  generation  time. 

Autolib  and  this  thesis.  It  is  important  to  note  that  the  differences  between  the  fixed  size  pro¬ 
gram  generation  system  and  the  automatic  general-sized  library  generation  system  are  mostly 
orthogonal  to  the  topic  of  this  thesis:  the  techniques  presented  in  this  thesis  are  primarily  of  an 
algorithmic  nature,  and  in  concept,  can  be  interfaced  either  with  the  original  Spiral  system  to 
produce  fixed  size  code,  or  with  Autolib  to  produce  general  sized  libraries. 

The  reason  for  this  is  the  following.  The  fundamental  assumption  in  this  thesis  is  that  FFTs 
can  be  manipulated  using  certain  degrees  of  freedom  and  factorizations  to  yield  high  perfor¬ 
mance  for  a  target  architecture.  In  essence,  this  thesis  establishes  a  set  of  rules  that  are  used  to 
rewrite  a  problem  specification  into  high  performance  platform-tuned  code.  In  their  straight¬ 
forward  form,  these  rules  can  be  used  to  produce  fixed  size  code.  However,  Autolib  can  use  the 
same  set  of  rules  to  automatically  derive  the  recursion  step  closure  that  is  needed  to  build  a 
general-sized  library.  In  practice,  some  rules  may  have  to  be  expressed  in  a  manner  that  is  more 
appropriate  to  the  type  of  program  generation  system  being  used.  We  use  the  fixed-size  program 
generation  infrastructure  for  the  Cell  platform,  and  the  general-size  infrastructure  for  the  cluster 
platform  in  this  thesis. 


2.5  Chapter  summary 

We  presented  SPL,  a  declarative  representation  that  is  at  an  abstraction  level  appropriate  to  sev¬ 
eral  manipulations  that  we  will  perform  later  in  this  thesis.  We  presented  the  FFT  algorithm  and 
several  existing  parallelized  versions  which  are  all  limited  by  their  lack  of  support  for  minimizing 
data  movement  costs.  Finally,  we  presented  Spiral  and  Autolib  which  use  SPL  along  with  rewrit¬ 
ing  systems  to  automatically  generate  fixed  and  general-sized  code  from  problem  specification 
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using  only  a  set  of  rewrite  rules  and  some  high-level  information  about  the  target  architecture. 
Spiral  rewrites  SPL  formulas  to  perform  well  on  certain  architectural  paradigms.  In  the  next 
chapter,  we  will  look  at  the  two  fundamental  architectural  paradigms  required  for  automatic 
program  generation  for  distributed  memory  computing  platforms. 


Chapter  3 


The  Parallelization  and  Streaming  Paradigms 


Optimizing  FFTs  for  distributed  memory  target  architectures  requires  a  thorough  understanding 
of  both  the  algorithm  and  the  architecture.  In  this  chapter,  we  first  present  an  abstraction  of  our 
distributed  memory  target  architectures,  followed  by  a  discussion  of  actual  platforms  used  for 
evaluation  in  this  thesis.  We  then  present  techniques  to  adapt  SPL  formula  fragments  found  in 
the  Cooley- Tukey  FFT  and  other  FFTs  to  two  fundamental  features  that  are  crucial  for  perfor¬ 
mance  on  distributed  memory  architectures:  parallelism  and  streaming.  We  use  the  techniques 
presented  in  this  chapter  as  building  blocks  in  the  next  chapter,  where  we  build  the  actual  FFTs. 


3.1  Target  Architectures 

In  this  section,  we  first  present  abstractions  of  our  target  architectures.  Then,  we  present  details 
of  the  actual  platforms  based  on  our  abstractions  that  were  chosen  to  evaluate  this  thesis.  Specif¬ 
ically,  we  look  at  the  Cell  Broadband  Engine,  and  three  supercomputers:  the  Axon  and  the  Cray 
XT4  and  XT5  systems. 

Our  target  architectures  can  be  broadly  described  as  MIMD  (Multiple  Instruction  stream. 
Multiple  Data  stream)  [Flynn,  1972]  and  NUMA  (Non-Uniform  Memory  Access).  An  abstraction 
is  shown  in  Figure  3.1.  As  this  figure  shows,  our  target  architectures  consist  of  several  identi¬ 
cal  PEs,  each  with  its  own  private  memory.  Below,  we  expand  this  abstraction  by  examining  its 
various  components  in  better  detail. 

Terminology:  nodes  versus  processing  elements.  To  avoid  the  ambiguity  of  the  conven¬ 
tional  term  “node”  in  the  context  of  systems  that  include  multicore  processors  and  incorporate 
multiple  processors  in  each  motherboard,  we  instead  use  the  term  “PE”  (PE)  to  refer  to  the  small- 
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Figure  3.1;  Abstraction  of  the  target  platform  architecture  used  in  this  thesis.  Optional  components  or 
features  are  italicized. 


est  unit  of  the  system  that  can  independently  perform  generic  computations  (such  units  can  in¬ 
ternally  contain  SIMD  units).  A  PE  is  typically  a  single  core  of  a  multicore  processor  inside  the 
system,  or  a  single  unicore  processor  inside  the  system. 

3.1.1  Abstraction  of  Target  Architectures 

Multiple  processing  elements.  Our  model  assumes  a  number  of  identical  PEs  that  operate  in¬ 
dependently,  as  shown  in  Figure  3.1.  Each  PE  consists  of  at  the  least  a  compute  processor,  which 
may  optionally  include  SIMD  vector  units.  Each  PE  may  also  optionally  contain  a  communica¬ 
tions  processor,  which  assists  in  performing  communications  in  the  background  without  inter¬ 
rupting  the  compute  processor. 

Since  we  are  interested  in  a  platform’s  ability  to  perform  numerical  computations,  we  mea¬ 
sure  this  ability  using  the  platforms’  peak  floating  point  performance,  which  is  the  maximum 
number  of  floating  point  operations  that  it  can  retire  per  unit  time,  usually  stated  in  giga  floating 
point  operations  per  second  (Gflop/s).  Floating  point  addition,  subtraction,  and  multiplication 
instructions  are  counted  as  one  operation  each;  fused  multiply-add  instructions  count  as  two 
operations. 

Distributed  memory.  Each  PE  possesses  its  own  local  memory.  We  assume  a  PE  has  faster 
access  to  its  local  memory  than  to  another  PE’s  local  memory.  Every  PE  is  assumed  to  have 
its  own  independent  memory  address  space.  The  last  two  are  not  strict  requirements,  and  it  is 
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straightforward  to  implement  our  ideas  on  systems  that  use  a  shared  memory  address  space,  or 
other  schemes  such  as  PGAS  (partitioned  global  address  space)  [Stitt,  2010]. 

Explicit  data  transfer.  We  also  assume  the  architecture  requires  explicit  programming  calls 
to  transfer  data  across  address  spaces.  Some  architectures  can  overlap  computation  with  com¬ 
munication.  These  architectures  may  use  a  one-sided  or  a  two-sided  communications  model. 
We  define  a  “one-sided  push”  model  as  one  where  the  sender  can  send  data  asynchronously 
using  a  non-hlocking  call,  and  can  later  block  on  completion  of  the  send  (the  receiver  cannot 
natively  block  on  the  receipt  of  expected  data).  We  define  the  “one-sided  pull”  model  analo¬ 
gously.  The  Cell  BE  (discussed  in  Section  3.1.2)  is  based  on  DMA  programming,  and  is  capable 
of  operating  under  both  the  push  and  pull  models.  We  also  include  architectures  with  a  two- 
sided  communications  model,  in  which  the  sender  can  use  a  synchronous  non-blocking  call  to 
send  data,  and  the  receiver  can  use  a  blocking  call  to  complete  receiving  expected  data.  The 
MPI  2.0  specifications  [Forum,  1998b]  include  calls  to  operate  under  both  the  one-sided  and  the 
two-sided  communications  models. 

Interconnection.  The  PEs  are  connected  through  an  interconnect.  We  assume  that,  from  a 
programming  perspective,  every  node  can  communicate  directly  with  every  other  node. 

The  interconnect  network  may  use  one  of  several  topologies,  including  a  ring  (e.g..  Cell  BE), 
hypercube,  fat-tree  (e.g..  Axon),  3D  torus  (e.g.,  BlueCene/P),  mesh  (Tilera  TILE64)  or  others.  The 
actual  hardware  connection  may  consist  of  an  on-chip  network  (Cell  BE)  or  other  transport  tech¬ 
nologies  including  InfiniBand,  Myrinet,  NUMAlink,  and  Ethernet. 

Since  our  EFTs  require  all-to-all  data  exchanges,  both  the  bandwidth  and  the  latency  of  the 
interconnect  network  will  critically  affect  performance.  Interconnect  performance  during  an  all- 
to-all  exchange  is  also  dependent  on  the  topology.  A  good  indicator  of  interconnect  performance 
is  its  bisection  bandwidth  [Hennessy  and  Patterson,  2003]. 

Memory  node.  An  optional  global  memory  node  may  also  exist  in  our  model.  For  distributed 
memory  architectures  that  reside  entirely  on  a  single  chip  or  motherboard  (e.g.,  the  Cell  BE), 
this  is  typically  the  main  memory.  Clusters  may  have  a  dedicated  storage  node.  We  assume  the 
address  space  of  the  memory  node  is  shared  among  the  PEs. 

Emerging  multicore  processors  and  the  distributed  memory  paradigm.  Although  some 
of  the  mainstream  multicore  architectures  (x86  Intel,  AMD  multicores)  use  the  shared  memory 
model,  it  is  important  to  distinguish  between  the  programming  model  and  the  design  and  con¬ 
straints  of  the  underlying  architecture.  Shared  memory  programming  models  allow  the  program¬ 
mer  to  assume  all  PEs  share  a  single  unified  memory  space,  and  are  typically  implemented  using 
cache  coherence  protocols.  As  the  number  of  cores  per  processor  scales,  so  does  the  penalty  to 
be  paid  in  accessing  data  cached  on  a  different  PE.  Consequently,  some  of  the  techniques  pre- 
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Figure  3.2:  Cell  processor  architecture.  The  PPE  is  dimmed  out  in  this  figure  since  it  is  not  used  in  this 
thesis.  Each  SPE  consists  of  an  SPU,  an  MFC,  a  memory  management  unit  (MMU),  and  a  local  store  (LS). 


sented  in  this  thesis  should  also  he  applicable  to  shared  memory  manycore  processors. 

The  ideas  presented  in  this  thesis  apply  to  any  platform  which  can  he  modeled  as  an  instance 
of  our  abstraction.  Actual  architectures  may  implement  this  abstraction  in  different  ways.  We 
consider  several  different  actual  platforms  for  this  thesis,  including  the  Cell  processor  (an  ex¬ 
ample  of  a  single-chip  implementation  of  this  abstraction,  which  is  becoming  an  increasingly 
common  form  of  multicores),  and  several  clusters.  Next,  we  show  how  each  of  our  chosen  archi¬ 
tecture  targets  implements  this  abstraction. 


3.1.2  Cell  Broadband  Engine 

Introduction  and  architecture.  The  Cell  Broadband  Engine  (Cell  BE,  or  simply,  the  Cell)  is  a  mul¬ 
ticore  processor  that  is  designed  for  high-density  floating-point  computation  required  in  multi- 
media,  visualization,  and  other  science  and  engineering  applications.  As  shown  in  Figure  3.2,  its 
design  includes  several  cores  called  synergistic  processing  elements,  or  SPEs,  and  a  main  core, 
the  power  processing  element  (PPE).  The  PPE  is  designed  to  be  the  “master”  core  where  the  op¬ 
erating  system  and  other  main  application  threads  run.  Numerically  intensive  computations  are 
performed  by  the  SPEs. 

Each  SPE  consists  of  a  compute  processor  (called  synergistic  processing  unit  or  SPU)  with 
SIMD  vector  units,  a  memory  flow  controller  (MFC),  and  a  fast  local  memory  (256kB  in  current 
versions),  called  the  local  store  (LS).  An  SPU  can  access  its  own  local  store  much  faster  than  it 
can  access  the  local  stores  of  other  SPUs,  or  the  main  memory.  The  memory  flow  controller  is 
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designed  to  handle  multiple  memory  requests  in  the  background  without  affecting  the  compu¬ 
tations  on  the  SPU,  thus  parallelizing  memory  operations  with  numeric  computation.  The  SPEs, 
PPE,  and  the  main  memory  controller  are  connected  via  an  on-chip  interconnect  called  the  ele¬ 
ment  interconnect  hus  (EIB),  whic  is  ring-hased  in  current  instantiations. 

Performance  characteristics.  We  consider  two  versions  of  the  Cell  architecture:  the  origi¬ 
nal  Cell  BE,  and  the  newer  PowerXCell  8i.  Both  include  a  single  PPE  and  8  SPEs  with  256kB  local 
stores.  Each  SPE  can  retire  one  4-way  vectorized  single  precision  EMA  (fused  multiply-add)  float¬ 
ing  point  instruction  per  cycle.  Therefore,  up  to  8  single-precision  operations  can  he  performed 
per  cycle.  At  a  clock  rate  of  3.2GHz,  this  translates  into  a  peak  performance  of  25.6  Gflop/s  per 
SPE,  or  204.8  Gflop  /  s  across  all  8  SPEs.  The  Cell  BE  provides  limited,  non-pipelined  support  for  2  - 
way  vectorized  double  precision  computation,  thus  restricted  to  a  low  peak  performance  of  12.8 
Gflop/s.  The  PowerXCell  8i  does  not  have  this  limitation  and  is  capable  of  a  double-precision 
peak  performance  of  102.4  Gflop/s. 

The  MFC  has  a  DMA  queue  that  can  service  up  to  16  outstanding  DMA  requests.  It  can  also 
work  on  “DMA-list”  requests:  the  SPU  can  submit  a  list  of  up  to  2, 048  DMA  requests  by  storing 
them  in  an  array  in  local  memory,  which  the  MFC  can  then  process  in  the  background. 

The  ring-based  EIB  contains  4  stepwise  unidirectional  channels,  providing  an  aggregate  band¬ 
width  of  204.8  GB/s.  Each  channel  that  emerges  from  the  ring  connects  to  either  an  SPE,  or  the 
PPE,  or  the  memory  or  I/O  controllers,  and  can  handle  up  to  25.6  GB/s  of  traffic.  Rambus  XDR 
memory  is  used  for  the  main  memory,  whose  address  space  is  striped  across  16  banks. 

Programming  model.  The  Cell  can  be  programmed  using  the  pthreads  library.  Main  mem¬ 
ory  is  accessed  by  the  PPE  and  SPE  threads  using  the  shared  virtual  memory  address  space.  The 
memory  addressing  scheme  also  allows  each  SPE  to  provide  other  SPEs  with  access  to  its  local 
store.  The  Cell  architecture  uses  a  DMA  (direct  memory  access)  based  memory  access  model. 
It  requires  the  programmer  to  explicitly  orchestrate  all  memory  and  inter-core  data  movement 
operations.  The  Cell’s  explicit  memory  access  programming  model  allows  for  finer  control  of  the 
memory  system  compared  to  a  conventional  cache-based  processor,  and  allows  for  potentially 
better  exploitation  of  memory  bandwidth.  However,  it  also  drastically  increases  the  program¬ 
ming  burden  and  expertise  required  by  the  software  developer.  An  automatic  code  generation 
and  optimization  system  is  thus  particularly  valuable  for  such  an  architecture. 

To  automatically  generate  high-performance  code  for  the  Cell  we  must  address  three  archi¬ 
tectural  characteristics:  1)  within  an  SPE  we  need  to  produce  SIMD  vector  code,  2)  executing 
code  across  multiple  SPEs  requires  parallelization,  and  3)  hiding  the  latency  of  data  transfer  be¬ 
tween  local  stores  and  the  main  memory  requires  streaming,  which  is  a  technique  similar  to 
software  pipelining. 
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The  use  of  several  architectural  paradigms  and  the  programming  complexity  due  to  the  need 
for  explicit  memory  orchestration  makes  the  Cell  a  challenging  and  therefore  an  excellent  choice 
of  platform  for  evaluating  the  work  in  this  thesis.  The  Cell  is  also  one  of  the  first  single-chip 
platforms  to  he  designed  with  the  distributed  memory  paradigm,  which  has  traditionally  been 
reserved  to  cluster-based  supercomputing  platforms.  As  the  number  of  cores  on  a  chip  scales  in 
the  future,  due  to  the  scaling  limitations  of  shared  on-chip  cache  designs,  the  distributed  mem¬ 
ory  paradigm  is  likely  to  find  its  way  into  an  increasing  fraction  of  chip  based  multiprocessors. 
In  this  sense,  the  Cell  is  a  likely  representative  of  a  class  of  future  platforms. 

3.1.3  Cluster:  Axon 

Distributed  memory  computing  architectures  are  traditionally  associated  with  clusters  comput¬ 
ers.  A  cluster  is  essentially  a  group  of  linked  computers  that  work  together  as  one  system.  We 
primarily  consider  “compute  clusters,’’  by  which  we  mean  clusters  used  for  scientific  comput¬ 
ing,  simulations,  and  other  applications  involving  intensive  numerical  computations.  Compute 
clusters  are  typically  tightly  coupled,  which  means  their  design  includes  a  low  latency,  high  per¬ 
formance  dedicated  interconnect  to  handle  frequent  communications  between  nodes.  Clusters 
are  primarily  defined  by  the  number  and  type  of  PEs,  and  the  topology  and  type  of  the  intercon¬ 
nect. 

Introduction  and  architecture.  The  Axon  is  a  supercomputer  with  256  Intel  Xeon  (Intel’s 
Core  architecture)  based  PEs  and  an  InfiniBand  network  with  a  tree  topology.  As  shown  in  Eig- 
ure  3.3,  the  Axon  consists  of  32  nodes  connected  via  two  InfiniBand  switches,  with  one  mother¬ 
board  per  node,  two  chips  per  motherboard,  and  4  cores  per  chip,  for  a  total  of  256  PEs. 

Performance  characteristics.  Each  PE  on  the  Axon  provides  a  peak  performance  of  3  Gflop  /  s 
(double  precision).  With  256  PEs,  the  double  precision  floating  point  peak  performance  is  thus 
768  Gflop/s. 

The  interconnect  performance  varies  depending  on  the  level  as  Eigure  3.3  shows.  At  the 
top  most  level,  each  node  is  connected  to  15  other  units  through  an  InfiniBand  4x  SDR  switch, 
with  each  link  capable  of  a  bandwidth  of  8  Gb/s.  The  two  switches  are  connected  via  4  similar 
links.  The  two  GPUs  on  each  motherboard  are  connected  via  the  motherboard’s  fast  intercon¬ 
nect,  while  the  cores  within  each  GPU  use  the  faster  on-chip  interconnect. 

Gommunication  tasks  must  also  be  handled  by  the  computing  cores,  i.e.,  the  Axon  has  no 
dedicated  cores  or  other  processing  units  exclusively  for  handling  communications  in  the  back¬ 
ground. 

Programming  model.  The  Axon  can  be  programmed  using  an  MPI  [Forum,  1998a]  library, 
which  is  typical  for  supercomputing  clusters.  The  address  space  is  distributed. 
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Figure  3.3:  Architecture  of  the  Axon  cluster  supercomputer. 

3. 1 .4  Cluster:  Cray  XT4  and  XT5 

Introduction  and  architecture.  The  Cray  XT4  and  its  upgraded  version,  the  XT5,  are  commer¬ 
cially  used  supercomputing  platforms.  They  use  AMD  Opteron  processors,  and  Cray  SeaStarZ 
or  SeaStar2-i-  interconnects.  Our  XT4  evaluation  platform  provides  16  4-core  compute  nodes  (64 
PEs),  and  our  XT5  evaluation  platform  provides  16  8-core  compute  nodes  (128  PEs). 

Performance  characteristics.  Each  XT4  node  contains  an  AMD  Opteron  QuadCore  proces¬ 
sor  clocked  at  2.2GHz  (64  PEs  at  8.8  Gflop/s  each),  and  each  XT5  node  contains  two  AMD  Quad- 
Core  Shanghai  2.7GHz  processors  (128  PEs  at  10.8  Gflop/s  each). 

The  SeaStar2  interconnect  on  the  XT4  provides  a  6.4  GB/s  connection  between  the  proces¬ 
sors  in  a  node,  and  7.6  GB/s  links  to  she  neighboring  PEs  in  a  3D  Torus  topology.  The  SeaStar2+ 
on  the  XT5  is  similar,  except  that  it  offers  9.6  GB/s  of  intra-node  bandwidth. 

Programming  model.  The  Gray  XT4  and  XT5  systems  can  be  programmed  using  an  MPI  [Fo¬ 
rum,  1998a]  library.  The  address  space  is  distributed. 

3.2  Framework  for  Adaptation  Through  Formula  Rewriting 

We  have  thus  far  examined  features  of  our  target  architecture.  In  this  section,  we  present  the 
framework  we  use  for  adapting  the  dataflow  of  FFTs  to  our  target  architecture  through  formula 
manipulation  and  rewriting.  We  will  use  this  framework  in  the  rest  of  this  chapter  to  define  the 
parallelization  and  streaming  paradigms.  The  basic  approach  that  we  use  is  similar  to  previous 
work  on  mapping  FFTs  to  SIMD  vector  architectures  [Franchetti  et  al.,  2006b]  and  cache-based 
SMP  multicore  processors  [Franchetti  et  al.,  2006a]. 
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Cray  XT4  architecture: 

16  nodes  (64  PEs) 

1  motherboard  per  node 
1  QuadCore  chip  per  motherboard 
2.2  GHz  per  core 
8.8  Gflop/s  per  core 


Figure  3.4:  Architecture  of  the  Cray  XT4  supercomputer. 


Cray  XT5  architecture: 

16  nodes  (128  PEs) 

1  motherboard  per  node 

2  sockets  per  motherboard 
1  QuadCore  chip  per  socket 
2.7  GHz  per  core 

10.8  Gflop/s  per  core 


Figure  3.5:  Architecture  of  the  Cray  XT5  supercomputer. 


Architectural  paradigms:  parallelization  and  streaming.  A  paradigm,  as  defined  in  Chap¬ 
ter  2,  is  an  architectural  feature  that  is  the  target  for  structural  optimizations  of  the  FFT.  Our  tar¬ 
get  architectures  may  include  several  features  including  cache  based  memory  hierarchies,  SIMD 
vectorization,  multiple  forms  of  parallelism,  and  support  for  explicit  background  memory  man¬ 
agement.  Previous  work  covered  code  generation  for  cache  based  systems  and  for  SIMD  vector 
units,  but  did  not  address  the  distributed  memory  parallelism  and  memory  streaming,  which  is 
the  focus  of  this  thesis.  In  most  cases,  the  previous  work  composes  well  with  our  work,  and  is 
reused.  Before  we  discuss  parallelization  and  streaming  in  detail,  we  sketch  the  basic  idea. 

Basic  idea:  rewriting.  We  consider  the  Cooley- Tukey  FFT  (2.12)  for  ID  DFTs,  and  the  row- 
column  FFT  (2.15)  for  2D  DFTs.  Both  inherently  contain  data  parallelism  represented  by  the 
occurring  tensor  products  of  the  form  (/  ^  A)  and  [A  <&  I)  as  we  explained  in  Section  2.4.2.  How¬ 
ever,  the  data  parallelism  as  is  may  not  be  suited  for  an  efficient  implementation  on  our  target 
platform.  To  solve  this  problem,  we  will  formally  manipulate  the  FFT  to  obtain  the  appropri¬ 
ate  structure.  The  design  of  the  rewriting  system  depends  on  the  paradigm  and  is  done  in  three 
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steps.  We  explain  these  steps  using  a  simplified  view  of  parallelism  as  an  example: 


Step  1:  Architecture  abstraction  and  specification.  We  determine  and  abstract  the  main  architec¬ 
tural  feature  that  is  the  target  of  our  optimization,  and  identify  relevant  parameters.  For 
parallelization,  this  may  be  the  number  of  processors,  p.  The  paradigm  is  then  specified 
by  its  name  and  the  parameters  (par(p)  in  our  example).  We  tag  an  SPL  expression  indicate 
that  we  want  to  adapt  the  expression  to  that  paradigm.  For  example,  to  parallelize  A  over 
p  PEs,  we  write: 


par(p) 

Step  2:  Identification  of  mappable  constructs.  We  determine  the  set  of  SPL  constructs  whose 
dataflows  map  naturally  well  to  the  paradigm.  These  constructs  can  then  be  directly  trans¬ 
lated  into  corresponding  C  code.  Our  goal  then,  is  to  rewrite  our  initial  FFT  algorithm  to 
be  composed  only  of  these  mappable  constructs.  Mappable  constructs  are  written  using 
special  tensor  products  to  identify  them. 

For  example,  as  mentioned  in  Section  2.4.2,  constructs  of  the  form  Ip  iSi  Am  naturally  map 
to  p  PEs.  In  this  case,  we  represent  the  tensor  product  using  the  i8>||  symbol: 

Ip  (S>||  Am  (3.1) 

Step  3:  Rewriting  via  formula  manipulation.  Our  goal  is  to  rewrite  a  given  SPL  expression  into  a 
form  that  is  directly  mappable  in  the  sense  discussed  above.  To  do  so,  we  determine  rewrite 
rules  that  convert  SPL  expressions  into  mappable  expressions.  We  remove  our  specifica¬ 
tion  tags  (shown  in  step  1)  once  the  rewriting  is  complete,  and  we  have  arrived  at  an  algo¬ 
rithm  consisting  of  mappable  constructs  (shown  in  step  2).  An  example  of  a  rewrite  rule 
is: 

In  ^  i^nlp  ^  (3.2) 

pai(p,p) 

The  rule  parallelizes  [In  ®  Am)  for  p  processors.  Note  how  the  right-hand  side  of  (3.2) 
matches  (3.1). 

Next,  we  expand  this  framework  to  the  two  main  architectural  paradigms  in  this  thesis:  par¬ 
allelization  and  streaming. 
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3.3  Parallelization 

This  section  presents  the  parallel  paradigm.  We  target  “coarse-grained”  platform  parallelism 
available  in  the  form  of  multiple  PEs.  Load  balancing  the  computation  is  our  primary  goal,  which 
is  necessary  to  achieve  good  performance.  We  include  explicit  memory  movement,  which  is  typ¬ 
ically  required  by  distributed  memory  platforms.  Our  methods  expand  the  prior  work  in  [Bonelli 
et  al.,  2006;  Franchetti  et  al.,  2006a]. 

3.3.1  Architecture  Abstraction  and  Specification 

Our  main  target  architectural  component  is  parallelism.  We  define  a  new  tag,  called  par()  to 
specify  that  we  want  to  parallelize  an  SPL  expression  To  generate  high  performance  parallelized 
FFT  code,  we  have  the  following  requirements: 

•  Load  balancing.  The  computation  load  should  be  divided  equally  among  all  PEs,  and  se¬ 
quential  code  should  be  minimized  to  improve  speed-up  (Amdahl’s  law).  We  specify  the 
number  of  PEs  using  the  tag  parameter  p. 

•  Explicit  distributed  memory  access.  Since  our  data  is  distributed  among  multiple  PEs, 
generated  code  must  include  explicit  memory  instructions  (message  passing  or  DMA)  to 
transfer  data  between  PEs.  Input  and  output  data  is  assumed  to  be  distributed  in  a  block 
fashion  among  the  PEs.  Non-standard  initial  and  final  data  distributions  are  discussed  in 
Section  4.2.2. 

•  Size  of  communication  packets.  We  design  our  algorithm  to  produce  data  exchanges, 
when  needed,  in  packets  of  a  size  that  is  optimal  for  our  target  architecture,  to  minimize 
communication  costs.  Typically,  this  means  producing  large  packet  sizes.  The  optimal 
packet  size  can  be  determined  either  using  architecture  specifications,  or  by  a  simple  ex¬ 
periment  that  measures  achieved  bandwidth  for  a  range  of  packet  sizes.  For  most  archi¬ 
tectures,  achieved  bandwidth  increases  with  increasing  packet  sizes  until  a  certain  point, 
which  is  the  architecture’s  optimal  packet  size.^  We  specify  our  packet  size  using  the  tag 
parameter  p. 

•  Computation/communication  overlap.  Optionally,  if  the  architecture  supports  it,  we  aim 
to  produce  algorithms  that  are  capable  of  overlapping  computation  with  communication 
to  minimize  communication  cost. 

^Note  that  the  transform  size  determines  the  maximum  possible  packet  size 
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•  Minimizing  barriers.  We  minimize  the  number  of  global  synchronization  points  or  barri¬ 
ers  to  maximize  performance.  Optionally,  when  supported  by  the  architecture,  we  produce 
code  that  does  not  use  global  barriers,  but  instead  uses  cheaper  local  barriers. 


3.3.2  Identifying  Mappable  Constructs 

We  now  identify  constructs  that  map  naturally  well  to  parallel  architectures,  keeping  in  mind  the 
goals  we  identified  above. 

Load  balancing.  As  explained  in  Section  2.4.2,  constructs  of  the  form  Ip  is>  Am  are  embarrass¬ 
ingly  parallel,  map  naturally  to  p  PEs.  As  in  [Franchetti  et  al.,  2006a],  we  use  a  tagged  version  of 
the  tensor  product,  <Si||,  to  signify  a  “parallel  loop”: 

Ip  <8i||  A. 

As  illustrated,  we  show  the  C  pseudo-code  equivalent  of  the  SPL  expression  y  =  {Ip  iS>||  Am)x. 
We  assume  a  distributed  memory  model  that  is  programmed  using  an  MPI  library,  and  assume 
that  the  input  and  output  vectors  are  divided  into  p  blocks  of  size  m  each.  As  an  example,  the 
contiguous  elements  of  the  x  vector  (shown  in  the  SPL  expression  above)  beginning  at  the  m  * 
element  and  ending  at  the  (m  *  {i  +  1))  -  1*^  element  are  resident  in  array  X  of  processor  i: 

void  A(double  *out  ,  double  *iii)  { 

//  Code  that  applies  A  to  in  to  produce  out 

> 

int  mainCint  argc  ,  char  ’t'^argy)  { 
double  +Y, 

//  Y:  portion  of  the  y  vector  assigned  to  this  PE 
//  X:  portion  of  the  x  vector  assigned  to  this  PE 

MPI_Init (feargc ,  &argv); 

MPI_Comm_rank (MPI_C0MM_W0RLD ,  &p) ; 

//  Allocate  and  initialize  X,  Y  here 

A(X,  Y);  //  Computes  (globally ) :  y  =  (Ip  x  A) 

MPI_Barrier (MPI_C0MM_W0RLD) ; 

MPI_Finalize () ; 

> 


As  seen  in  the  code  above,  each  PE  simply  applies  A  to  its  chunk  of  the  input  vector. 

Explicit  memory  access.  Explicit  memory  accesses  are  difficult  to  express  in  SPL,  and  are 
better  expressed  at  the  lower  level  using  Z-SPL.  We  introduce  new  constructs  in  Z-SPL  to  ex- 
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press  explicit  memory  access,  and  provide  an  an  example  of  converting  SPL  code  into  Z-SPL  in 
Section  3.3.3. 

Size  of  communication  packets.  Permutations  of  the  form  (where  P  is  any  permuta¬ 
tion)  access  memory  in  chunks  of  size  p.  We  use  a  tagged  tensor  product  ®  to  recognize  these 
constructs; 


P®I^. 

This  means  they  are  in  their  final  form  required  for  accessing  memory  in  packets  of  size  p. 
Minimizing  harriers.  Synchronization  barriers  must  occur  only  when  a  PE  uses  data  that  has 
been  previously  processed  by  another  PE.  We  discuss  this  in  the  context  of  building  complete  EFT 
algorithms  in  Chapter  4. 

Vectorization.  Our  rewrite  rules  should  be  compatible  with  the  previous  work  on  SIMD  vec- 
torization  [Franchetti  et  al.,  2006b]. 

3.3.3  Rewriting  via  Formula  Manipulation 

Constructs  of  the  form  In®  A  are  easily  parallelized  when  p\n^  using  perform  simple  loop  split¬ 
ting.  As  a  rewriting  rule,  this  is  expressed  as: 

In  ®  Ayn  *  Ip  ‘S'li  [In/p  ®  Ani)j  THfl! p  ^  fl  (3.3) 

par(p,p) 

Note  that  the  rule  also  requires  that  the  block  size  mn!  pis,  effectively  the  packet  size,  which 
must  be  at  least  as  large  as  the  requested  packet  size  p. 

In  (3.3),  iterations  in  the  parallel  loop  [Ip  iSi||  Am)  are  executed  simultaneously  on  p  PEs.  The 
input  vector  is  divided  into  p  blocks  of  size  mnip  each,  and  each  block  is  read  in  by  a  PE  in 
packets  of  size  p.  The  largest  possible  packet  size  is  hence  p  =  mn/p.  We  show  the  corresponding 
MPI  code  below; 

void  A(double  *out  ,  double  *iii)  { 

//  Code  that  applies  A  to  in  to  produce  out 

} 

int  mainCint  argc ,  char  **argv)  { 
double  *Y, 

MPI_Init (feargc ,  &argv); 

MPI_Comm_rank (MPI_C0MM_W0RLD ,  &p) ; 

^In  this  thesis,  we  only  target  sizes  where  p\n  (where  p  is  the  number  of  PEs).  Extension  to  cases  were  p  does  not 
divide  n  is  accomplished  by  executing  the  left-over  parallel  iterations  on  a  subset  of  the  PEs 
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//  Allocate  and  initialize  X  here 

//  Loop  performs  (I(n/p)  x  Am)  on  this  node: 
for (i=0; i<n/p; i++) 

A(Y+(i+m) ,  X+(i*m)) ; 

MPI_Barrier (MPI_C0MM_W0RLD) ; 

MPI_Finalize  () ; 


Next,  we  rewrite  the  construct  Am  ^  I„,  which  can  he  converted  into  our  parallel  mappahle 
form  using  (2.8)  and  (3.3): 


Unlp^Ammir^h) 

pai(p,p)  pai(p,p)  par(p,p)  parip,p) 


However,  this  leads  to  packet  sizes  of  p  =  1  for  and  in  the  equation  above. 

This  is  because  each  PE  works  on  a  single  kernel  during  each  iteration  of  the  loop  over  n/p.  As 
an  example,  we  first  provide  a  formula,  followed  by  its  dataflow  graph: 

F2^h^  iLl^h)il2  ®||  il2»F2))iLl^h) 

paY(p,p) 


In  the  dataflow  graph  below,  data  points  and  kernels  in  the  dataflow  graph  assigned  to  the 
first  PE  are  shown  in  black,  while  those  assigned  to  the  second  PE  are  shown  in  red: 
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As  seen  in  the  right  side  dataflow  graph,  in  its  first  iteration,  PE  0  first  loads  xq  and  X4,  com¬ 
putes  a  single  F2  kernel,  and  stores  the  results  to  yo  and  3/4.  Then,  in  its  second  iteration,  PE  0 
loads  Xi  and  X5,  computes  a  single  F2  kernel,  and  stores  the  results  to  yi  and  3/5. 

To  increase  the  packet  size,  each  PE  can  load  and  store  data  for  multiple  successive  kernels 
(at  least  p  many)  jointly.  This  can  he  done  in  several  ways.  We  use  the  formula  helow  which  is 
one  way  to  capture  the  partitioning  of  the  dataflow  graph  such  that  p  successive  data  elements 
in  memory  are  always  loaded  or  stored  at  the  same  time: 

(I™^<ii/^){/pi8)||  (AmiSi/^))(L™^ii)/^),p=  n/p 

par(p,^) 

Note  that  our  load  and  store  packet  sizes  appear  as  p  in  the  formula  above.  Using  the  same 
example  as  earlier,  we  now  parallelize  F2  ^  h  to  load  and  store  in  packets  of  size  2: 

F2  ®  /4  -  [Lj^hUh  ®||  iF2  ®  h))iLt^h) 

par(p,/;) 

As  illustrated  in  the  following  dataflow  graph  (first  processor  in  hlack,  second  processor  in  red), 
PE  0  now  loads  both  xq  and  its  neighbor  xi  together,  and  also  X4  and  X5  (packet  size  in  each 
case  is  therefore  now  2).  It  then  computes  two  F2  kernels  before  storing  the  results  again  using  2 
packets  of  size  2  each. 
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Since  our  SPL  construct  now  requires  explicit  memory  access,  we  first  convert  it  to  Z-SPL  to 
extract  the  required  memory  accesses  before  generating  C  code. 


Explicit  memory  access.  As  explained  in  Section  2.4.3,  Z-SPL  distinguishes  explicitly  be¬ 
tween  skeletons  (compute  kernels)  and  decorations  (loads  and  stores,  including  permutations). 
We  extend  this  framework  to  mark  up  scatters  and  gathers  in  Z-SPL  that  represent  explicit  data 
transfers  instead  of  (implicit)  loads  and  stores.  We  denote  a  sum  to  be  parallelized  such  that  ev¬ 
ery  iteration  runs  on  a  different  PE  by  .  We  introduce  a  specially  marked  gather  Gdt(<7)  and 
scatter  Sdt(^7)  (DT  stands  for  “data  transfer”)  to  express  explicit  communication  to  access  non¬ 
local  data.  These  are  similar  to  the  scatter  and  gather  matrices  defined  in  Section  2.4.3,  except  for 
two  differences:  a)  we  use  the  data  transfer  scatter  and  gather  only  with  the  mapping  function 
(2.23),  which  means  that  these  scatter  or  gather  data  in  specified  packet  sizes,  and  b)  these  get 
converted  into  target-specific  explicit  memory  transfer  instructions  like  DMA  or  MPl  code. 


To  illustrate  the  process  of  generating  explicit  memory  accesses  using  Z-SPL,  we  provide  an 
example  below: 
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Am®  In-  ®||  {Am  ®  Ip)){LZ^®Ip) 


par(p,^) 


^  ‘^DxC^fc.n/p.p)  ^{^j,p)AmG{hjp)  GDT(^fc,«/p,p)> 


p-1 


k=0 


IJ,=  nip  (3.4) 


The  second  step  is  a  standard  SPL  to  Z-SPL  conversion  except  for  the  first  and  last  SPL  factors. 
These  are  permutations,  and  are  converted  into  explicit  memory  scatters  and  gathers  in  the  outer 
parallel  sum.  Also,  note  how  the  explicit  memory  scatter  and  gather  access  memory  in  chunks 
of  size  p.  The  explicit  memory  scatter  and  gather  operations  are  easily  converted  into  code.  We 
provide  an  example  below  where  we  convert  the  following  scatter  into  DMA-based  code; 


p-i 

X  'SDT(<7fc,2,4) 
k=0 


void  DMA_SCATTER ( sour ce * ,  dost*,  pk_size) 

{  spu_mf cdma64 ( source ,  dest ,  size*8,  1,  MFC_PUT_CMD) 
} 

/*  Code  runs  in  parallel  on  2  processing  elements  */ 
int  main () 

{  for(i;=0;  ±<=7;  i++)  //  Left  most  scatter 

{  DMA_SCATTER ( scat_f unc (X , i ) ,  scat_f unc (Y , i) ,  4) 


> 


} 


The  other  remaining  basic  SPL  constructs  are  rewritten  to  produce  explicit  memory  accesses 
similarly,  and  summarized  in  Table  3.1. 

Increasing  communication  packet  size.  Most  interconnects  perform  poorly  when  trans¬ 
ferring  small  data  packets,  since  the  per-packet  overhead  may  not  amortize,  or  the  underlying 
hardware  may  not  be  utilized  effectively  for  packet  sizes  below  a  threshold.  This  is  illustrated 
in  the  simple  experiment  in  Figure  3.6,  which  shows  the  achieved  interconnect  bandwidth  as  a 
function  of  packet  size  for  the  same  amount  of  data  transferred. 

Unfortunately,  the  data  exchange  packet  sizes  in  the  parallelization  approach  shown  so  far 
do  not  scale  well  with  the  number  of  PEs.  This  is  because  the  number  of  available  iterations 
of  the  kernel  (given  the  dimension  of  the  identity  matrix)  is  split  between  the  packet  size  and 
number  of  PEs  (in  other  words,  n  =  pp).  For  a  given  n,  as  p  increases,  p  decreases.  Since  the 
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Cell:  Main  Memory  Bandwidth 

Measured  bandwidth  [normalized  to  maximum] 


Figure  3.6:  Cell  Main  memory  bandwidth  measured  for  a  single  read  and  write  performed  on  2^®  bytes  of 
contiguous  data  at  various  packet  sizes.  Measured  bandwidth  is  normalized  to  the  theoretical  maximum 
bandwidth  of  25.6  GB/s. 


CT-FFT  algorithm  consists  of  two  factors,  the  minimum  packet  size  scales  as  a  square  root  of  the 
size  of  DFT,  as  shown  in  Figure  3.7. 

With  the  previous  parallelization  method  each  PE  sends  multiple  packets  to  every  other  PE. 
[Bonelli  et  al.,  2006]  and  [Chen  and  Johnson,  2004]  present  methods  that  factorize  the  stride  per¬ 
mutations  into  local  and  global  parts;  each  processor  pt  first  locally  collects  all  packets  to  be 
sent  to  processor  pj,  assembles  them,  and  sends  a  single  huge  packet  to  pj,  which  disassem¬ 
bles  them  upon  receipt.  The  cost  of  the  local  assembly  and  disassembly  are  largely  hidden  by 
being  implemented  as  readdressing  inside  the  computation  kernels,  using  loop  merging  tech¬ 
niques  developed  in  [Franchetti  et  al.,  2005].  The  factorization  is  shown  below.  We  use  the  tag 
to  indicate  parallelization  of  A  using  the  large  packet  algorithm  (note  that  this  tag  is  not 

parBig(p) 

parameterized  by  p,  because  it  is  meant  to  parallelize  with  the  largest  packet  size  possible): 

ir  (3.5) 

parBig(p) 

Using  this  factorization,  we  can  convert,  for  example.  Am  <s>  !„  into  its  parallelized  equivalent: 
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Packet  Size  Scaling:  Basic  Parallelization 


Problem  size 


Figure  3.7:  Packet  size  scaling  using  the  basic  parallelization  algorithm.  Each  line  corresponds  to  a  speci- 

fied  number  of  PEs.  The  minimum  packet  size  p  is  given  by  2^^  where  N  is  the  problem  size,  and 

p  is  the  number  of  PEs. 


parBig(p)  parBig(p)  parBig(p)  parBig(p) 

-{/p  ®||  ®||  (ip  ®  Imp)) 

[Ip  (S)||  {Ifi/p  ®  ^m)) 

iIp®\\lZ%^)iL^p  ^Imnlp^)ilp^\\  ^Lp^Imlp))  (3.6) 

As  an  additional  benefit,  because  each  processor  sends  exactly  one  packet  to  each  other  pro¬ 
cessor  during  an  all-to-all  exchange,  this  method  can  use  the  nipi_alltoall  call  (or  its  equiva¬ 
lent)  when  available,  which  is  typically  optimized  for  a  given  platform  based  on  its  topology  and 
interconnect  characteristics.  In  this  thesis,  we  assume  the  nipi_alltoall  call  is  the  fastest  way 
to  perform  an  all-to-all  exchange,  and  do  not  explore  alternatives,  as  done  in  [Chen  and  Johnson, 
2004]  or  FFTW  [Frigo  and  Johnson,  2005]. 

When  using  the  factorization  shown  in  (3.6),  the  packet  size  in  the  data  exchanges  is  always 
performedatAJ/P^,  where  N  is  the  size  of  the  transform  [N  -  mn  in  (3.6)).  This  scales  better  than 
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the  previous  method,  as  shown  in  Figure  3.8.  Note  that  because  the  packet  size  is  always  fixed  at 
N/P^,  the  parallelization  tag  for  this  methods  does  not  contain  a  packet  size  specification. 


Packet  Size  Scaling:  Large  Packet  Parallelization 


Figure  3.8:  Packet  size  scaling  using  the  large  packet  algorithm.  Each  line  corresponds  to  a  certain  num¬ 
ber  of  processors.  The  minimum  packet  size  /i  is  given  hy  Nlp^,  where  N  is  the  problem  size,  and  p  is  the 
number  of  PEs. 


Vectorization.  However,  the  method  above,  as  taken  from  [Bonelli  et  al.,  2006],  does  not 
lend  itself  easily  to  SIMD  vectorization  of  the  inner  kernel.  This  is  because  restructuring  the 
formula  above  for  vectorization  would  cause  the  introduction  of  additional  permutations  which 
are  superfluous  and  expensive.  Nofe  fhat  2-way  vectorization  can  be  performed  using  the  fact 
that  the  DFT  involves  complex  operations.  For  architectures  that  use  4-way  vectorization  or 
higher  (either  current  single  precision  architectures,  or  future  architectures  like  the  Larrabbe 
which  are  expected  to  provide  4-way  vectorization  support  for  double-precision  computation), 
we  develop  a  variant  that  can  be  vectorized: 


[Am  ®  In)  -  [LZll”  ®  Iv)  Unlv  ^  Am  ^  h)  ' 


parBig(p) 


parBig(p) 


parBig(p) 


parBig(p) 


®ll  L[Z%yp’'  ^  ®  ^rnnlpAUp  ®||  l”  ®  Imlp) 


(3.7) 

(3.8) 


{Ip  <^||  Inipu  ^  ®  (3.9) 

^11  ■f'Sr.v'p  ^  ®  ®  ImnlpAilp  ®||  Tp  ^  Inip)  (3.10) 
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The  factors  in  (3.10)  are  now  compatible  with  vectorizable  constructs  as  presented  in  [Franchetti 
et  al.,  2006b]. 

Degree  of  freedom  in  factorizing  the  stride  permutation.  The  factorization  shown  in  (3.5) 
has  a  degree  of  freedom  that  can  be  exploited  in  certain  situations.  If  m  =  n,  The  stride  permu¬ 
tation  can  also  be  factorized  as: 


jmn  ffTmn\-l\-l  _  _  rrmn^T 


-  [{Ip  ®||  I7/)(L^'  ®  Imnlp^)ilp  ®ll  ip  ®  Inlp)Y 

-  {Ip  ®||  ®  /„/p)T(LP'  ^  Irr^nlp^fiIp  ®ll  Cp 


Therefore, 


^  -(/p^iiL™p®/„/p)(L^'®/„„/p2)(/p®i|i;;;;^p).  (3.11) 

parBig(p) 

We  will  use  the  alternative  factorization  in  (3.11)  to  our  advantage  when  developing  DFT 
algorithms  in  Chapter  4. 

3.3.4  Summary 

In  summary,  our  parallel  paradigm  optimizes  the  structure  of  tensor  products  for  load  balanced 
execution  across  multiple  PEs,  with  inter-processor  data  exchanges  being  made  in  specified 
packet  sizes.  Our  basic  parallel  tag  is  parameterized  by  par(p,  p),  where  p  is  the  number  of  PEs 
and  p,  the  minimum  packet  size.  We  rewrite  all  SPL  constructs  into  the  two  basic  mappable  con¬ 
structs  Ip  i8>||  Am  (load  balanced  execution),  and  P^Ip  (data  reads/writes  with  a  specified  size). 
Our  large  packet  parallelization  tag  is  parameterized  by  parBig(p),  where  p  is  the  number  of  PEs. 

We  rewrite  all  SPL  constructs  into  the  two  basic  mappable  constructs  Ip  iSi  n  Am  (load  balanced  ex- 

2 

ecution),  and  Lp  is>  Inip2  (data  exchanges  with  size  n/p^  where  n  is  the  problem  size.  Our  rewrite 
rules  are  summarized  in  Table  3.1. 


3.4  Streaming 

This  section  presents  the  streaming  paradigm.  Streaming,  also  known  as  double-buffering  or 
multi-buffering  ^  is  a  technique  to  reduce  or  eliminate  the  cost  of  data  movement  by  overlapping 

^The  name  arises  from  the  fact  that  multiple  buffers  must  he  maintained  to  implement  the  technique. 


3.4.  Streaming 


59 


■  A  barrier  B 

par(p,fi)  par(p,p) 

■  Ip  0||  [Inip  ®  ^m) 

,  mp 


[Lm  ®  In/p^)[^p  ^^n/p) 

paiip,p) 


par(p,p) 

iPe>I„)- 

par(p,p) 

par(p,p) 

par(p,p) 


"mip 

par(p,p) 

.  pm 


par(p,p) 
mnlp\ 


par(p,p)  par(p,p) 

[Psi  I„ip]»Ip 


P-1 

•011^1 

1=0 


p-1  jnlp-l  ' 

A:=0  \  7=0  / 

p-1  ' 

A:=0  V7=0  / 


-  A  barrier  B 
parBigCp)  parBig(p) 

'  ®ll  i^n/p  ®  ^m) 


r  mn 


■  ®ll  ^^mnlp^)^h  ®ll  (-^p  ®  Imlp)) 


r  mn 


•  ®ll  ^m/p  ®  Inlp)iLl  0  /^„/p2)(/p  0|| 


(3.12) 

(3.13) 

(3.14) 

(3.15) 

(3.16) 

(3.17) 

(3.18) 

(3.19) 

(3.20) 

(3.21) 

(3.22) 

(3.23) 


Table  3.1:  Parallel  paradigm  rewrite  rules.  P  is  any  permutation,  D,Di  are  diagonal  matrices.  The  tag 
par(p,p)  specifies  parallelization  across  p  PEs,  using  a  minimum  data  packet  size  of  p  for  inter-processor 
data  exchanges.  We  assume  nlp>  p  for  all  the  above.  The  tag  parBig(p)  specifics  parallelization  across  p 

p2 

PEs  using  our  large  packet  algorithm.  The  construct  [Lp  0  fm„/p2)  is  a  global  all  to  all  data  exchange  to  be 
implemented  appropriately  using  explicit  memory  access. 

All  the  basic  SPL  constructs  shown  in  Table  2.1  can  be  parallelized  using  these  rules.  For  the  basic  paral¬ 
lelization  algorithm,  note  that  rules  (3.13)-(3.17)  are  shown  in  SPL  to  aid  the  reader  in  understanding  the 
loop  transformations  that  occur.  In  reality,  our  system  converts  SPL  constructs  directly  into  Z-SPL  to  create 
parallel  loops  and  explicit  scatters  and  gathers  as  required,  as  shown  in  (3.18)-(3.19). 
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it  with  computation.  Streaming  requires  hardware  support. 

Data  movement:  background.  Data  access  in  computer  systems  is  typically  expensive:  an 
un-cached  load  from  memory  can  cost  hundreds  or  thousands  of  cycles,  while  a  general  purpose 
processor  can  typically  retire  several  floating  point  computations  in  each  cycle.  The  memory 
hierarchy  was  designed  to  alleviate  data  access  costs  in  uniprocessor  systems  where  data  must 
he  moved  between  the  processor  and  main  memory.  Spiral’s  initial  efforts  were  focused  on 
designing  recursive  FFT  algorithms  that  took  maximal  advantage  of  the  memory  hierarchy. 

FFT  algorithms  in  distributed  memory  systems  require  data  exchange  amongst  the  PEs.  Mov¬ 
ing  data  to  its  destination  quickly  and  before  it  is  needed  is  important,  and  may  significantly 
affect  performance.  Chip-based  distributed  memory  systems  like  the  Cell  processor  have  rela¬ 
tively  fast  on-chip  interconnects.  To  scale  well,  on-chip  interconnects  may  have  a  ring  (e.g..  Cell 
BE) ,  mesh  (e.g. ,  Tilera  TILE64) ,  or  other  scalable  topologies,  and  may  use  packet-based  switching 
networks.  Contention  on  these  topologies  may  impact  interconnect  performance.  Cluster  based 
distributed  memory  systems  use  inter-node  interconnects  like  InfiniBand  or  Ethernet,  where 
data  movement  costs  are  much  higher,  and  may  dominate  FFT  performance  on  these  systems. 

Streaming  on  modern  hardware.  To  alleviate  communication  costs,  modern  distributed 
memory  architectures  may  allow  programs  to  programs  to  set  up  explicit  data  transfer  requests 
(amongst  the  PEs  and  between  the  PEs  and  main  memory)  that  will  proceed  in  the  background, 
in  parallel  with  computation.  For  example,  the  Cell  BE  supports  background  DMA  communi¬ 
cations;  the  BlueGene/P  includes  processors  that  can  be  dedicated  to  servicing  communication 
requests  in  the  background.  The  MPI  standard  includes  non-blocking  MPI  calls  for  this  purpose. 
However,  designing  algorithms  and  writing  programs  to  take  advantage  of  streaming  may  not  be 
trivial.  We  use  our  streaming  paradigm  to  build  FFT  algorithms  that  take  maximum  advantage 
of  such  architectures. 

Our  main  goal  is  to  generate  code  that  overlaps  computation  with  communication  for  the 
four  basic  SPL  constructs.  We  use  our  rewriting  framework  presented  earlier  to  build  our  stream¬ 
ing  paradigm. 

3.4. 1  Architecture  Abstraction  and  Specification 

Each  of  the  basic  SPL  constructs  involves  a  loop  over  a  kernel.  The  computation  consists  of  the 
kernel  itself,  and  the  communication  consists  of  the  required  loads  from  the  input  vector  and 
stores  to  the  output  vector.  We  assume  the  source  and  destination  for  the  loads  and  stores  is 
main  memory,  although  it  could  also  be  other  PEs.  We  define  a  new  tag,  called  streainQ  to 
specify  that  we  want  to  adapt  an  SPL  expression  to  the  streaming  paradigm.  The  following  are 
our  requirements: 
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•  Overlap  of  computation  and  communication.  The  idea  of  streaming  is  similar  to  software 
pipelining,  and  involves  computing  the  i’th  iteration  of  tensor  product  loop  while  simul¬ 
taneously  storing  the  results  of  the  {i  -  1)*  iteration,  and  loading  data  for  processing  the 
[i  +  1)**^  iteration.  This  is  illustrated  in  the  following  dataflow  graph,  which  shows  iteration 
2  of  /4  iSi  F2  being  executed.  Writes  from  iteration  1  and  reads  for  iteration  3  are  performed 
simultaneously. 
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•  Explicit  memory  access.  Generated  code  must  include  explicit  memory  instructions  (mes¬ 
sage  passing,  DMA,  etc.),  similar  to  our  parallel  paradigm. 

•  Size  of  communication  packets.  Similar  to  the  parallel  paradigm,  we  must  design  our 
algorithm  to  produce  exchanges  of  data  in  packets  of  a  size  that  is  optimal  to  our  target 
architecture  to  minimize  communication  costs  and  maximize  performance.  Typically,  this 
means  producing  large  packet  sizes.  We  specify  our  packet  size  using  the  tag  parameter  y/.'^ 

3.4.2  Identifying  Mappable  Constructs 

We  now  identify  constructs  that  map  naturally  well  to  streaming  architectures,  keeping  in  mind 
the  goals  we  identified  above. 

The  trivial  case  is  In  Am.  Normally,  this  involves  computing  the  kernel  Am  n  times  on  the 
input  vector.  In  our  streamed  case,  we  tag  the  tensor  product  using  the  symbol  <81=  to  show  that 

^We  use  fi  to  denote  packet  sizes  used  for  inter- PE  data  transfers,  and  y/  to  denote  packet  sizes  used  for  data  trans¬ 
fers  between  PEs  and  main  memory. 


62 


Chapter  3.  The  Parallelization  and  Streaming  Paradigms 


we  interpret  it  (at  program  generation  time)  as  a  loop  where  we  compute  the  i’th  iteration  while 
simultaneously  writing  and  reading  the  vectors  for  the  previous  and  next  iterations.  Thus,  con¬ 
structs  of  the  form: 

Is  i8>=  Affi 

are  in  their  final  form  required  for  overlapped  computation  and  communication.  The  following 
pseudo-code,  written  with  a  DMA-hased  programming  paradigm  (for  an  architecture  such  as  the 
Cell  BE)  illustrates  how  this  loop  is  implemented: 

#define  SWAP(x,  y)  {double*  temp=x;  x=y;  y=temp;} 

/*  DM A _ G ET ( dest ,  source^  n)  is  a  non -blocking  DMA  call  to  get  n  data 
elements  from  source  (in  main  memory)  to  dest  (in  local  memory ) . 

DMA_PUT ( source ,  dest,  n)  is  a  non -blocking  DMA  call  to  put  n  data 
elements  from  source  (in  local  memory)  to  dest  (in  main  memory ) . 

*/ 

void  A(double  *y ,  double  *x)  {  //  Code  that  applies  A  to  x  to  produce  y  } 

void  transform (double  *Y,  double  *X)  { 

//  Header: 
i  =  0; 

DMA_GET (altBufX ,  X[i*m],  m) ; 

BL0CK_0N_DMA () ; 

SWAPdocalX,  altBufX); 

DMA_GET (altBufX ,  X[(i+l)*m],  m) ; 

A(localY[0],  localX[0]); 

BL0CK_0N_DMA  ()  ; 

//  Main  body 
for(i=l;  i<s-2;  i++)  { 

//  Start  non-blocking  write  of  previous  iteration’s  data 
SWAPClocalY,  altBufY); 

DMA_PUT (altBuf Y ,  Y[(i-l)*m],  m) ; 

//  Start  non-blocking  read  of  next  iteration’s  data 
SWAPdocalX.  altBufX); 

DMA_GET (altBuf X ,  X[(i+l)*m],  m) ; 

//  Compute  current  iteration 
A(localY[0],  localX[0]); 

BL0CK_0N_DMA () ; 


} 
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//  Footer:  (put  s-2,  compute  s-1,  put  s-1) 

SWAPClocalY,  altBufY); 

DMA_PUT (altBufY ,  Y [ ( s - 2) *m]  ,  m) ; 

SWAPdocalX,  altBufX); 

A(localY[0],  localX[0]); 

BL0CK_0N_DMA () ; 

SWAPClocalY,  altBufY); 

DMA_PUT (altBufY .  Y[(s-l)*m].  m) ; 

BLQCK_ON_DMA () ; 

} 

Minimum  iterations  required.  Note  that  to  implement  a  streamed  loop,  the  first  and  last 
iterations  cannot  involve  overlap,  as  they  are  the  setup  and  teardown  stages,  respectively.  Thus, 
we  need  a  minimum  of  3  iterations  in  any  streamed  loop,  giving  us  the  constraint  s  >  3.  When 
s  =  3,  only  the  middle  iteration  can  he  overlapped,  while  the  first  and  last  are  not,  which  means 
only  a  third  of  the  entire  involves  overlap.  Therefore,  in  practice,  we  need  more  iterations  for 
effective  hiding  of  memory  costs.  Streamed  loops  with  with  4  and  8  iterations  can  effectively 
overlap  at  most  50%  and  75%  of  its  iterations  respectively. 

Explicit  memory  access.  To  generate  code  with  explicit  memory  accesses,  we  use  the  same 
technique  as  discussed  earlier  in  Section  3.3.3.  We  only  state  the  differences  here.  First,  instead 
of  the  parallel  iterative  sum,  we  use  a  new  iterative  sum  specially  tagged  for  streaming,  the  ^  , 
which  corresponds  to  SPL’s  streaming  tensor  product  (the  0=).  Second,  when  necessary,  we  en¬ 
sure  that  the  scatters  and  gathers  for  explicit  memory  access  are  marked  to  communicate  with 
main  memory  rather  than  with  other  PEs. 

Size  of  communication  packets.  Permutations  of  the  form  access  memory  in  chunks 
of  size  Similar  to  our  parallelization  case,  we  use  a  tagged  tensor  product  <8i=,  and  recognize 
that  constructs  of  the  form: 

are  in  their  final  form  required  for  accessing  memory  in  packets  of  size  i//. 

3.4.3  Rewriting  via  Formula  Manipulation 

As  shown  above,  constructs  of  the  form  /„  ®  Am  constitute  our  naturally  mappahle  case,  and  are 
trivial  to  rewrite: 


In  ®  Am  Is  ilnls  I\.m) 
stream(i//) 


(3.24) 
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Here,  we  read  and  write  in  packets  of  size  mnis.  A  kernel  of  this  size  [mn!  s)  must  also  fit  on- 
chip  (in  the  local  store  or  scratchpad  or  cache).  Kernels  smaller  than  a  certain  size  (dependent 
on  the  architecture)  might  perform  poorly. 

To  convert  the  remaining  three  basic  constructs  (as  shown  in  Table  2.1)  into  the  A  form, 
we  first  examine  the  construct  Am  In-  If  performed  naively,  for  each  iteration  of  the  tensor 
product  loop,  we  end  up  reading  m  packets  of  size  1  (complex  element)  each: 

Arr^^  LZ^{In^^Am)LT 

stream(i//) 


The  following  dataflow  graph  example  shows  two  reads  and  two  writes  of  size  1  each  to  com¬ 
pute  iteration  2  (highlighted  in  red)  of  ¥2®! a,  as  seen  in  the  formula  below: 

stream(i//) 


To  achieve  larger  packet  sizes,  as  done  in  the  parallel  paradigm,  we  process  iterations  to¬ 
gether,  where  ^^f  is  the  desired  read/write  packet  size  (note  that  xf/  kernels  must  now  fit  in  the  local 
memory).  When  doing  so,  we  bring  in  m  single  elements  from  the  input  vector  which  are  spaced 
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at  a  stride  of  n\ 

Arn^  -  {LZ'  V)  (/.  ® ^  ®  /^))  {Lf^  ® ^  1^)  (3.25) 

stream(i/A) 

where  =  nis.  The  leftmost  and  rightmost  factors  above  are  permutations  that  gather  and  scat¬ 
ter  data  from/ to  memory  in  packets  of  size  y/.  Our  previous  example  now  contains  two  reads 
and  two  writes  of  size  2  each,  to  compute  iteration  1  (which  now  computes  two  F2  hutterflies)  of 
F2  ®  ^4-  as  seen  in  the  formula  below: 


F2^  -  [L*  h)  [h  iF2  ®  h))  {L\  h) 

stream  (1//) 


y  ^ - X 


void  transform (double 

*Y,  double  *X)  { 

//  Header: 

i  =  0; 

f  or ( int  j  =0  ; 

j  <111 ; 

J+  +  ) 

DMA_GET (altBuf X . 

X [i*m] ,  u) ; 

BLQCK_0N_DMA  0  ; 

SWAP (localX  , 

altBufX) ; 

f  or ( int  j  =0 ; 

j  <111 ; 

J  +  +  ) 

DMA_GET (altBuf X . 

X[(i+l)*m],  u); 

AClocalY  [0]  . 

localX  [0]  )  ; 

BLQCK_0N_DMA  0  ; 
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//  Main  body 
for(i=l;  i<s-2;  i++)  { 

//  Start  non-blocking  write  of  •previous  iteration’s  data 
SWAPdocalY,  altBufY); 
forCint  j=0;  j<m;  j++) 

DMA_PUT (altBufY ,  Y[(i-l)*m],  u) ; 

//  Start  non-blocking  read  of  next  iteration ’ s  data 
SWAPdocalX,  altBufX); 
fordnt  j=0;  j<m;  j+  +  ) 

DMA_GET  (altBufX  ,  X[d  +  l)*m],  u)  ; 

//  Compute  current  iteration 
AClocalYCO],  localXLO]); 

BLOCK_ON_DMA () ; 

} 

//  Footer:  (put  s-2,  compute  s-1,  put  s-1) 

SWAPdocalY,  altBufY); 
fordnt  j=0;  j<m;  j+  +  ) 

DMA_PUT (altBufY .  Y[(s-2)*m].  u) ; 

SWAPdocalX.  altBufX); 

AdocalYLO],  localX[0]); 

BL0CK_0N_DMA  () ; 

SWAPdocalY,  altBufY); 
fordnt  j=0;  j  <m ;  j+  +  ) 

DMA_PUT (altBufY .  Y[(s-l)*m],  u) ; 

BLQCK_ON_DMA () ; 


The  other  two  remaining  basic  SPL  constructs  are  streamed  similarly,  using  the  technique  of 
combining  multiple  kernels  to  increase  packet  size. 

Interaction  with  parallelization  and  vectorization.  Streamed  algorithms  may  also  have  to 
be  parallelized  and/or  vectorized.  Streaming  combines  well  with  the  parallelization  paradigm,  as 
seen  by  the  similarity  of  the  rewrite  rules  (in  Table  3.1  and  Table  3.2) .  Performing  both  paradigms 
on  an  SPL  construct  involves  splitting  loops  between  streaming  iterations  and  parallel  iterations. 
The  splitting  order  determines  the  type  of  code  generated,  and  is  discussed  in  detail  in  Chap¬ 
ter  4.  Streaming  also  combines  well  with  the  vectorization  paradigm:  vectorization  and  stream¬ 
ing  transform  the  construct  in  an  orthogonal  manner,  with  the  streaming  tag  always  being  the 
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outer  one: 


Is  Inis 

vec(v) 

" - " 

vec(v) 

The  standard  vectorization  rules  apply  to  remove  the  vectorization  tag  on  the  right  hand  side  of 
both  the  above  rules.  In  the  first  case,  unavoidable  permutations  are  created  to  perform  vector¬ 
ization,  while  the  second  case  is  already  a  naturally  vectorized  construct  as  long  asy/>v,  (where 
y/  is  the  streaming  packet  size,  and  v  is  the  SIMD  vector  length),  which  typically  is  the  case. 

3.4.4  Summary 

In  summary,  our  streaming  paradigm  optimizes  the  structure  of  tensor  products  for  streamed 
execution,  with  data  being  read  and  written  with  a  minimum  specified  packet  size.  Our  stream 
tag  is  parameterized  by  stream (i)/) ,  where  y/  is  the  minimum  packet  size  used  by  loads  and  stores. 

We  rewrite  all  SPL  constructs  into  the  two  basic  mappable  constructs  Is  <&=  Am  (streamed 
loop,  interpreted  accordingly  at  program  generation  time) ,  and  which  reads  or  writes 

data  with  a  packet  size  of  y/.  Our  rewrite  rules  are  summarized  in  Table  3.2. 


3.5  Chapter  Summary 

In  this  chapter,  we  presented  an  abstraction  of  our  target  architecture,  followed  by  details  of 
actual  platforms  used  to  evaluate  the  ideas  in  this  thesis.  We  then  identified  the  two  most  funda¬ 
mental  architectural  paradigms  that  we  optimize  for,  which  are  the  parallelism  and  the  stream¬ 
ing  paradigms.  Finally,  we  presented  SPE  constructs  that  naturally  map  well  to  each  of  the 
paradigms,  and  showed  how  to  rewrite  all  of  the  other  basic  SPL  constructs  into  these  natural 
paradigms  using  rewrite  rules.  In  the  next  chapter,  we  will  build  directly  upon  the  ideas  pre¬ 
sented  in  this  chapter  to  construct  algorithms  for  the  DFT  that  are  optimized  for  our  target  ar¬ 
chitectures. 
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■  A  barrier  B 

streain(i/A)  stream(i/^) 
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(3.26) 

(3.27) 

(3.28) 

(3.29) 

(3.30) 

(3.31) 

(3.32) 

(3.33) 

(3.34) 


Table  3.2;  Streaming  paradigm  rules.  P  is  any  permutation,  D,Dj  are  diagonal  matrices.  The  tag 
stream(i/^)  specifies  streaming  using  a  minimum  data  packet  size  of  i/a  for  loads  and  stores.  We  assume 
nlp>y/  for  all  the  above. 
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In  the  previous  chapter,  we  developed  rules  to  rewrite  SPL  expressions  to  fit  two  architectural 
paradigms  fundamental  to  this  thesis:  parallelization  and  streaming.  In  this  chapter,  we  use 
these  rewrite  rules  to  design  FFTs  that  are  tailored  to  our  target  platforms.  We  first  present  some 
usage  scenarios  to  show  the  types  of  code  we  want  to  generate.  Then,  we  discuss  FFT  code  opti¬ 
mized  for  latency,  followed  hy  FFT  code  optimized  for  throughput. 

Base  algorithms.  The  streaming  and  parallel  FFTs  designed  in  this  chapter  are  derived  from 
the  generalized  Cooley- Tukey  FFT  (2.12)  for  ID  DFTs,  and  on  the  row-column  algorithm  (2.15) 
for  2D  DFTs.  Both  are  reproduced  here  for  convenience. 

Cooley- Tukey  ID  DFT  breakdown  rule: 

DFT„„  —  (DFT„®/ot)D„,„{/„<8iDFT,„)L;^'”.  (4.1) 

Row-column  2D  DFT  breakdown  rule: 

DPT^xn  ^  DFTmi8>DFT„  ^  (DFT^  iSi/n)(/m  <8iDFT„).  (4.2) 

Problem  sizes  and  constraints.  In  this  thesis,  we  focus  on  two-power  sizes.  Since  these 
are  composite  sizes,  the  Cooley- Tukey  FFT  is  applicable.  We  also  assume  that  the  number  of 
PEs  divides  the  problem  size.  In  addition,  when  SIMD  vectorization  is  required,  we  assume 
that  the  problem  size  is  compatible  with  the  underlying  vectorization  technique  used.  Usually, 
this  means  that  the  size  is  a  multiple  of  the  square  of  the  SIMD  vector  length,  as  discussed  in 
[Franchetti  et  al.,  2006b]. 
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Figure  4.1:  Usage  scenarios:  (a)-(d)  have  to  be  optimized  for  latency  (notice  that  only  a  single  DFT  kernel 
is  performed  in  these  cases);  (e)-(g)  have  to  be  optimized  for  throughput  (multiple  DFTs  are  performed). 
The  box  with  the  arrow  around  the  PEs  denotes  operations  on  vectors  resident  in  the  memory  node. 


4.1  Usage  Scenarios 

When  designing  a  high-performance  implementation  of  the  DFT,  application  demands  have  to 
he  considered.  In  this  section,  we  classify  the  typical  usage  scenarios  to  help  us  guide  our  algo¬ 
rithm  construction  in  the  rest  of  the  chapter. 

An  application  might  require  a  library  that  takes  into  account  the  following  options: 

•  Latency  versus  throughput  optimization:  An  application  might  require  either  a  latency- 
optimized  version  to  compute  a  single  DFT  that  is  in  the  critical  path  of  operations  to  he 
performed  on  the  input  data,  or  might  require  a  throughput-optimized  version  to  compute 
a  large  number  of  DFTs  on  a  stream  of  input  data. 

•  Input  and  output  data  resident  in  local  memories  versus  data  resident  in  main  memory: 
This  distinction  typically  exists  in  chip-based  instantiations  of  our  distributed  memory 
architecture  abstraction. 

•  A  specified  number  of  PEs  to  use  versus  determining  and  using  an  optimal  number  of  PEs: 
An  application  may  demand  a  specific  number  of  PEs  to  be  used  regardless  of  the  PEs 
available  on  the  platform. 

The  combination  of  such  requirements  and  other  specifications  such  as  the  size  or  data  for¬ 
mat  of  the  DFT  and  the  number  of  PEs  allocated  gives  rise  to  a  large  number  of  possible  usage 
scenarios  as  depicted  in  Figure  4.1.  A  small  latency- optimized  DFT  kernel  can  work  on  vectors 
resident  on  a  single  PE  as  shown  (a).  Medium  sizes  DFTs  must  be  parallelized,  and  can  work  on 
vectors  resident  either  across  multiple  PEs  or  stored  in  the  memory  node  as  shown  in  (c)  and 
(d).  Large  DFTs  must  be  streamed  in  and  out  of  the  PEs  and  computed  in  parts  as  shown  in 
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(b).  Throughput-optimized  versions  can  use  streaming  to  hide  memory  access  costs,  see  (e),  or 
execute  small  DFTs  independently  on  multiple  PEs  as  shown  in  (f),  or  stream  larger  DFTs  paral¬ 
lelized  across  multiple  PEs  as  shown  in  (g). 

In  Section  4.2,  we  construct  latency  optimized  FFTs  for  input  vectors  distributed  across  the 
memories  of  the  PEs.  In  Section  4.3,  we  consider  input  vectors  resident  on  the  memory  node. 
In  Section  4.4,  we  construct  throughput- optimized  distributed  memory  FFTs.  Finally,  in  Sec¬ 
tion  4.5,  we  show  how  the  specifications  for  the  various  usage  scenarios  shown  in  Figure  4.1  are 
implemented  in  our  program  generation  system. 


4.2  Latency  Optimized  Distributed  DFTs 

In  this  section,  we  parallelize  our  FFTs  for  distributed  memory  systems,  where  the  input  and 
output  vectors  for  the  DFT  are  distributed  across  the  PEs  on  which  they  will  be  performed. 

We  first  construct  a  parallelized  FFT  using  our  rewrite  rules  built  in  Chapter  3.  We  then  dis¬ 
cuss  three  ways  to  improve  the  performance  of  this  algorithm,  including  different  data  distribu¬ 
tions,  large  packet  sizes,  and  overlapped  communication. 

4.2.1  Basic  Parallelization 

We  use  rewrite  rules  from  Section  3.3  to  design  our  parallelized  ID  and  2D  FFTs.  We  assume  that 
the  input  data  is  distributed  across  p  PEs.  This  means  the  input  array  of  size  n  is  divided  into  p 
equal  chunks  of  size  nip  each  (we  assume  p  divides  n),  and  the  fth  PE  contains  the  Tth  chunk  of 
the  array  in  its  local  memory.  In  some  cases,  computing  using  a  smaller  or  larger  number  of  PEs 
than  the  number  of  PEs  the  array  is  distributed  on  may  be  beneficial.  Such  rescaling  techniques 
were  addressed  in  [Bonelli  et  al.,  2006],  but  are  not  considered  in  this  thesis. 

In  the  following  sections,  we  first  demonstrate  our  techniques  on  the  ID  DFT,  and  then  show 
how  they  can  also  be  applied  to  the  2D  DFT. 

Notation  for  twiddle  factors.  Since  twiddle  factors  in  the  DFT  only  introduce  a  multiplica¬ 
tion  by  a  constant,  they  do  not  affect  our  optimizations.  Hence,  we  use  the  following  abuse  of 
notation  for  convenience: 


DFT°  -  (DhTf^’’  ^ilm) 

—  (E)FT^  ®Ifyi)Dn,m 


Representation  of  context.  In  the  rest  of  this  chapter,  we  frequently  point  to  parts  of  an  SPL 
expression  to  show  what  they  mean.  To  aid  in  this  process,  we  provide  contextual  information  of 
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an  SPL  expression  by  using  braces  above  of  the  expression.  Braces  under  an  expression  always 
refer  to  tags.  For  example: 

context 


tagO 

Basic  parallelization.  We  begin  by  parallelizing  the  Cooley- Tukey  ID  DFT  shown  in  (4.1). 
We  tag  the  left  hand  side  of  (4.1)  to  specify  that  we  want  to  parallelize  it  across  p  PEs  with  data 
exchanges  performed  at  a  minimum  packet  size  of  p.  We  apply  (4.1)  and  use  (3.12)  to  distribute 
the  tags  among  the  SPL  constructs,  and  then  remove  the  tags  by  using  the  rules  from  Table  3.1. 


DFTotjz  (DFT;2  iS>/,jj)  Dfi  ffi  (7,2  iSi  DFTto)L,j 
par(p,^)  par(p,p)  par(p,p)  par(p,p) 

-  (DFT°„  »Im)  {In^'DFTm)LT' 

par(p,p)  par(p,p) 

comm  comp  comm 

-((L7S7^(7p®||  (DFT>/„2/p))(f7^f^ 

comp  comm 

[Up  ®||  Unip  ®  [L^p’^^Inlp])  (4.3) 

Parallel  stages.  The  algorithm  in  (4.3)  is  composed  of  alternating  stages  of  computation  and 
communication  (global  data  exchanges),  and  thus  has  the  form  of  bulk  synchronous  process¬ 
ing  parallel  programming  model  described  in  [Valiant,  1990].  The  hrst  step  in  (4.3),  which  is  the 
rightmost  {Lf^"^^Inip)  performs  an  all-to-all  communication  to  redistribute  data  amongst  the 
PEs.  (4.3)  then  performs  four  alternating  stages  of  computation  and  communication.  This  is 
visualized  in  the  final  line  of  (4.3),  where  the  overbraces  mark  the  stages  with  “comp”  for  com¬ 
putation  and  “comm”  for  communication. 

Observe  that  in  (4.3),  all  the  computation  stages  are  p-way  parallel,  as  described  in  Sec¬ 
tion  2.4.2,  and  the  communication  stages  exchange  data  packets  in  blocks  of  size  mip  (last  two) 
or  nip  (first  one),  as  explained  in  Section  3.3.2. 

Composing  parallelized  SPL  constructs.  When  multiple  parallel  SPL  constructs  are  com¬ 
posed,  the  following  issues  arise: 

•  Barriers.  As  (3.12)  shows,  barriers  are  required  between  constructs,  due  to  data  depen¬ 
dencies. 


•  Address  space.  We  use  the  following  rules  to  generate  code  for  (4.3)  that  uses  a  distributed 
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memory  address  space; 

1.  A  construct  of  the  form  IpiSi\\  A  is  performed  in  parallel  across  p  PEs,  with  A  being 
performed  in  the  local  address  space  of  each  PE. 

2.  a  DFT°  that  appears  as  a  part  of  A  in  constructs  of  the  form  Ip<&\\  A  must  he  parame¬ 
terized  hy  p  (not  shown),  so  as  to  use  the  appropriate  portion  of  the  twiddle  factors, 
as  explained  earlier.  The  entire  set  of  twiddle  factors  are  distributed  among  the  PEs. 

3.  A  construct  of  the  form  P^Ip  is  a  global  permutation,  and  thus  involves  an  all  to 
all  communication.  We  interpret  the  construct  to  read  and  write  to  a  virtual  global 
shared  address  space  that  is  partitioned  into  p  chunks  and  assigned  in  a  block  distri¬ 
bution  to  each  of  the  p  PEs. 

•  Combining  data  exchanges.  If  multiple  data  exchange  stages  appear  between  two  com¬ 
putation  stages,  they  are  optimally  implemented  by  first  combining  them  using  standard 
tensor  product  rules  into  a  single  communication  stage.  For  example,  assuming  A  and  B 
are  computation  stages,  P,Q  permutations,  and  p2^  P\' 


A-  (P®/;Lii)  ■  •  5  ^  A-  [{P  ®  Ip^/p^)Q^Ip^]-B 


(A,  B  are  compute  kernels,  P,  Q  are  stride  permutations)  combines  to  a  single  exchange 
with  the  minimum  packet  size  of  the  two. 

•  Pushing  versus  pulling  data  exchanges.  In  “push  model”  architectures,  a  single  commu¬ 
nication  stage  between  two  computation  stages  is  attached  to  the  earlier  stage  as  its  store 
operation,  while  in  a  “pull  model  architecture”,  it  is  attached  to  the  later  stage  as  its  load 
operation 

Increasing  performance.  To  further  increase  the  performance  of  algorithm  (4.3),  we  must 
eliminate  bottlenecks  in  each  of  the  computation,  and  each  of  the  communication  stages.  We 
use  previous  work  ([Franchetti  et  al.,  2006b])  to  produce  fast  vectorized  kernels  for  the  computa¬ 
tion  stages,  and  therefore  assume  these  are  optimized.  We  are  thus  left  with  the  task  of  optimiz¬ 
ing  the  communication  stages  in  this  thesis.  Communication  can  be  sped  up  by  three  possible 
factors: 

•  One  or  more  the  communication  stages  can  be  eliminated  by  using  an  initial  and  final  data 
distribution  optimal  for  performance.  This  is  covered  next  in  Section  4.2.2. 
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Figure  4.2:  Block  data  distribution  example:  4  PEs. 

•  On  most  platforms,  achieved  bandwidth  is  increased  when  optimal  packet  sizes  are  used. 
Typically,  this  means  using  large  packets,  which  amortized  packet  overhead  costs.  We  de¬ 
rive  an  FFT  that  cheaply  assembles  small  packets  to  form  large  packets  before  sending 
them.  This  is  covered  in  Section  4.2.3. 

•  Communication  costs  can  also  be  reduced  by  hiding  them  partially  or  fully  by  overlapping 
them  with  computation.  This  is  covered  in  Section  4.3  and  Section  4.4. 

4.2.2  Data  Distribution 

So  far,  we  have  assumed  that  the  input  and  output  vectors  follows  a  block  distribution.  An  ex¬ 
ample  of  a  block  distribution  for  four  PEs  is  shown  in  Figure  4.2. 

Applications  may  either  demand  different  data  distributions,  or  may  accept  other  data  dis¬ 
tributions  in  exchange  for  higher  performance.  We  can  modify  our  parallel  algorithm  to  handle 
other  data  distributions  simply  by  adding  permutations  to  to  the  beginning  and  to  the  end  as 
necessary,  and  using  loop  merging  techniques  from  [Franchetti  et  al.,  2005]  to  minimize  the  cost 
of  such  permutations.  Changing  the  data  layout  is  equivalent  to  composing  the  input  and  the 
output  of  an  SPL  formula  with  permutations.  Therefore,  instead  of  computing  the  DFT,  we  in¬ 
stead  compute: 

DFT^  =P“^  xDFTxQ  (4.4) 

Based  on  this,  we  note  that  when  we  choose  P  and  Q  appropriately,  we  can  effectively  cancel  out 
the  initial  and  final  permutations  (with  the  general  structure  LigiDFT)  in  (4.3),  and  thus  increasing 
performance  by  eliminating  two  all-to-all  data  exchanges.  Such  a  distribution  is  called  the  block- 
cyclic  data  distribution,  discussed  below. 

Block  cyclic  data  distribution.  In  this  type  of  distribution,  visualized  below,  a  vector  of  size 
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Number  of  PEs:  4 
Block  size:  N/8 


Figure  4.3:  Block-cyclic  data  distribution  example:  4  PEs,  block  size  of  AT/S.  This  means  that  the  vector  of 
size  N  is  divided  into  8  blocks  of  size  N/8  each,  and  each  block  is  assigned  in  a  cyclic  fashion  to  the  PEs. 

N  is  divided  into  fc-sized  blocks,  and  distributed  in  a  cyclic  fashion  among  p  PEs.  An  example 
with  a  block  size  of  N/8  distributed  in  a  block  cyclic  fashion  among  4  PEs  is  shown  in  Figure  4.3. 

To  use  this  data  distribution,  we  conjugate  our  EFT  with  ®  1^).  (Conjugating  A  with  a 
permutation  P  means  composing  A  with  the  permutation  and  its  inverse  to  achieve  P~^AP).  In 
the  special  case  where  Nlk  =  mp,  and  k=  ml  pin  (4.3),  this  results  in  the  first  and  last  commu¬ 
nication  stages  of  (4.3)  being  cancelled  out,  which  can  significantly  improve  performance: 


DFT^2  ^ 

par(p,p) 

iDFr^0l,nlpmLp^^Imlp)Up^\\  Uml  p  ^'DFT,n)LZ]'pWp"' p)f^'" 

-  ((/p®||  {P)Fr^^Imlp)){L^''^ImlpWp^\\  [Imlp^^^Tm)L'^,'l^)] 

Generalized  data  distributions.  The  same  general  idea  applies  when  we  want  to  use  other 
data  distributions,  which  can  be  integrated  into  our  framework.  We  now  consider  different  types 
of  data  distributions  based  on  stride  permutations.  Specifically,  assuming  L  is  a  stride  permuta¬ 
tion: 


•  (/p  <8i  Inip)  -  In  is  a  block  distribution  (default), 

•  P  is  a  cyclic  distribution, 

•  (P  i8>  Ip)  is  a  block-cyclic  distribution  of  block  size  p  (a  common  distribution), 

•  [Ip  <8i||  P)  is  a  locally-cyclic  distribution. 
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•  ilp  <8i||  (P  iS>  /))  is  a  locally-block-cyclic  distribution 

Using  the  above,  we  can  generate  code  for  various  data  distributions.  Matching  input  and 
output  distributions,  provided  by  a  conjugate  pair  of  permutations,  are  called  symmetric  distri¬ 
butions.  We  can  also  easily  generate  code  for  asymmetric  distributions.  Note  that  asymmetric 
distributions  can  also  include  the  same  type  of  distribution  for  both  the  input  and  output,  but 
with  different  parameters  (e.g.,  a  block-cyclic  distribution  with  different  block  sizes  for  the  input 
and  output) . 

To  summarize,  we  generate  code  for  various  data  distributions  by  applying  the  appropriate 
permutations,  and  using  loop  merging  (from  [Franchetti  et  al.,  2005])  to  minimize  the  cost  of 
the  permutations.  Further,  we  can  also  identify  distributions  that  will  enhance  performance  by 
eliminating  the  initial,  final,  or  both  communication  stages. 

4.2.3  Increasing  Data  Exchange  Packet  Size 

In  (4.3),  the  first  communication  state  uses  a  packet  size  of  n/p,  while  the  second  and  the  third 
use  a  size  of  ml  p.  Since  mn  =  N  where  N  is  the  problem  size  of  the  DFT,  for  a  given  p,  increasing 
m  results  in  a  decrease  of  n  and  vice  versa.  To  achieve  packet  sizes  that  are  approximately  equal 
in  all  the  communication  stages,  we  must  follow  a  square  root  decomposition  (exact  or  approx¬ 
imate)  of  N  into  m  and  n.  This  still  results  in  packet  sizes  for  the  ID  DFT  that  do  not  scale  very 
well  with  increasing  p,  as  shown  in  Figure  3.7. 

On  many  platforms,  the  overhead  of  sending  a  packet  is  very  high  until  a  minimum  thresh¬ 
old  size  is  reached,  as  shown  in  Figure  3.6.  In  this  section,  we  provide  an  alternate  parallelized 
algorithm  that  produces  larger  packets.  To  do  so,  we  assemble  all  packets  to  be  sent  from  one  PE 
to  another  into  one  large  packet,  send  it,  and  disassemble  it  on  the  receiving  end,  as  was  shown 
in  (3.5).  We  first  present  some  permutation  identities  required  to  build  our  algorithm,  and  then 
present  the  scalar  and  SIMD  vectorized  versions  of  the  large  packet  algorithm. 

Permutations  identities.  We  will  use  the  following  identities  in  our  large  packet  algorithm: 

For  matrices  A,B,C  and  a  stride  permutation  P, 


iABC)'^ 

=  c'^b'^a'^ 

(4.5) 

(Zl^B)"^ 

=  a'^  ^b'^ 

(4.6) 

(A®B)“^ 

=  A~^^B~'^ 

(4.7) 

P~^ 

-  pT 

(4.8) 
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={Ip  ®||  ^  Imnlp^)Up  ®ll  ip  ^  Imlp)  (4.9) 

jmn  _  ffTmn\-l\-l  _  ffTmn\T\T  _  rrmn-vT 
~  J  J  ~  )  J  ~  ^ 

=  [Up  ®||  L7/)(L^'  ®  Imnlp^nip  ®ll  ip  ®  Inlp)V 
=  {/p  ®||  LZ  ®  Inlpfi^'  ®  Imnlp^fUp  ®||  i^/)^ 

=  [Ip  ®||  (L™  )/p  ®  /(n)/p))(i^'  ®  hm)n)lp^)Up  ®ll  (4.10) 

Large  packet  algorithm.  We  begin  by  reproducing  tbe  basic  large  packet  algorithm  from 
[Bonelli  et  al.,  2006] .  As  in  Chapter  3,  we  use  the  tag  A  to  indicate  parallelization  of  A  using 

parBig(p) 

the  large  packet  algorithm; 


DFTmn  "*•  (DFT^  <S>/n)  Dfyi,n  DFTfi)Ljj^ 

parBig(p)  parBig(p)  parBig(p) 


(In 


parBig(p) 

)DFT„)  LZ"  D 


m,n 


(/„®DFT„)  L’-l 


parBig(p)  parBig(p)  parBig(p)  parBig(p)  parBig(p)  parBig(p) 
local  perm  local  perm 


®ll  ®  him)n)lp^)  Up  ®||  (L;  ^  kmVp)) 


comp 


[Ip  ®||  /„/p  iSiDFT° 


local  perm  comm  local  perm 

®ll  ®  hn)m)lp^)[lp^\\  [LZ^knVp)) 

comp 

{Ip  iSi||  Imlp  ®  DFT^) 


local  perm 


®ll  ®  hm)n)lp^)  [Ip  ®||  [L;  ^  hmVp)) 


((m)  n)lp 


local  perm 


In  the  above,  in  addition  to  marking  the  communication  and  computation  parts,  we  mark 
expressions  with  “local  perm”  to  indicate  that  an  SPL  expression  is  a  local  permutation,  which 
means  each  PE  locally  permutes  its  own  data. 

Rearranging,  and  using  (3.11)  for  the  first  factor,  we  get: 
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’■[Ip  iS)||  [L^ffiyp  Iln)lp))  [Lp  Imnlp'^') 

comm 

[Ip  ®||  ®||  Imp  ®  DFr„)(/p  ®||  [4"  ®  I^„ip2) 


(7p®ll  (L™®/(„)/p))(7p®||  7„/p^DFT„)(7p®||L;2"’'P){LP  ®7„„/p2) 

[Ip  ®||  [Lp  ®  I[m)lp)) 
local  perm 


[Ip  1^11  [L(yfi)ip  ®  I(n]lp))  [Lp  iSi  Imnip2) 
comp 


[Ip  '^11  (DFT^  ®I„/p))  [Lp  iS>  Ifnnlp^^ 

comp  comm 

' - - - - 2 - - ' 

(7p®ll  {(L“®7(„)/p)(7„/p®DFT„)(Lj2p'^)H'f'P  ^Imnip^) 

local  perm 


[Ip  '^11  [Lp  <8>  I{m)lp)) 


(4.11) 


(4.12) 


Notice  in  the  above  that  we  have  now  merged  in  the  local  permutations  with  adjoining  com¬ 
putation  stages  whenever  possible.  We  use  loop  merging  techniques  from  [Franchetti  et  al.,  2005] 
to  accomplish  this  merging  in  practice.  Local  permutations  that  do  not  have  an  adjoining  com¬ 
putation  block  (the  first  and  the  last  local  permutations)  must  be  performed  separately.  Also 
notice  that  during  each  communication  stage,  each  PE  sends  one  single  packet  of  size  mn/ to 
every  other  PE,  as  explained  in  Chapter  3. 


Vectorized  large  packet  algorithm.  Below,  we  derive  a  v-way  SIMD  vectorized  variant  of 
(4.12): 
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DFT^n  (DFTm  <8i/n)  Dfyifi  DFTn]L^‘ 


parBig(p,v)  parBig(p,v)  parBig(p,v)  parBig(p,v) 

-  ^  h)  {iniv  ®  DFr„  ^ly)  iLZZJ’'^  iv)  Umiv  ^  Uv  ^  dft„)l7)  ®  /.) 

parBig(p,v)  parBig(p,v)  parBig(p,v)  parBig(p,v)  parBig(p,v) 

-  Up  ®||  ®  ®  ®  IvKIp  ®||  (Fp  ®  Inivlp  ®  /.)) 


vec(v) 


Up  iSi||  Inipv  F)FT^  igily) 

^  ^  ^ 

vec(v) 

®ll  ®  ®  hm)nlvlp-  ®  IvKIp  ®||  ®  Imlp  ®  /.)) 


vec(v) 


Up  ®  II  /m/pi.  ®  Ui.  «>  DFT„)L7) 


vec(v) 


{Ip  ®||  (1“;’'"'''^'^  ®  ®  /(„)™/Wp2  ®  IvKIp  ®||  Inip  ®  Iv)) 


local  perm 


vec(v) 

comm 


•  Up  ®||  ®  ® 


vec(v) 


comp 


Up  ®||  UZ  ®  I(niv)ip  ^  Iv))Up  ®||  4/p.  ®  DFT°„  ^Iy){Ip  ®||  (i[Sp ®  4)) 


vec(v) 


{Lp  ^Imn/p^^ 


comp 


Up  ^11  UZ^  ®  I(m)lp  ®  Iv))Up  ®||  Imipv  ®  (4  ®  DFT„)L7)(7p  ®||  U^Ziirp''^'^  ® 

" - " 

vec(v) 


{Lp 

local  perm 


(/p^ll  UZ''’  ®kn)lp®Iv)) 
vec(v) 


This  is  a  SIMD  vectorizable  version  of  (4.12)  based  on  our  rules  developed  and  presented  in 
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Table  3.1.  This  can  be  seen  by  all  SPL  expressions  above  (except  the  second  computation  stage, 
which  must  be  vectorized  as  is)  having  the  general  form  A®  ly,  which  is  naturally  vectorizable 
as  discussed  in  Section  2.4.2.  SIMD  vectorization  must  be  taken  into  account  when  parallelizing 
our  algorithm,  if  SIMD  vectorization  is  performed  separately  at  a  later  stage,  additional  unneces¬ 
sary  permutations  maybe  introduced,  which  would  affect  performance.  Note  that  our  hnal  algo¬ 
rithm  above  includes  a  vector  tag  that  is  removed  using  rules  from  previous  work  in  [Franchetti 
etal.,  2006b]. 


4.2.4  2D  DFTs 


So  far,  we  have  looked  at  parallelization  of  ID  FFTs.  Parallelization  of  2D  FFTs  is  similar,  except 
for  the  two  following  differences.  First,  we  use  the  row-column  algorithm  (4.2)  as  the  basic  algo¬ 
rithm.  Second,  because  of  the  structure  of  the  row  column  algorithm,  we  only  have  two  global 
permutations,  which  means  we  only  have  two  global  exchanges  of  data  as  opposed  to  three  in 
the  ID  case.  We  can  use  either  our  basic  parallelization  algorithm  presented  in  Section  4.2.1 
or  use  our  large  packet  algorithm  presented  in  Section  4.2.3  to  perform  our  parallelization.  We 
show  each  of  these  helow. 

Basic  parallelization  of  the  2D  DFT: 

DFTmx«  ^  (DFTyn  ®In)  {Im  DFT,;) 

par(p,^)  par(p,p)  par(p,p) 

comm  comp  comm  comp 

^^mlp)  (fp  ®  II  (DFT„  ^Imlp))  ((Fp  ®||  (Fn/p  ®  DFT,^)] 


Large  packet  parallelization: 
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DFXmxn  (DFTm  ^  DFTn) 

parBig(p)  parBig(p)  parBig(p) 

^  (7„^DFT^)  (/^^DFTJ 

parBig(p)  parBig(p)  parBig(p)  parBig(p) 

local  perm  comm  local  perm 

^  (-fp  ®ll  ®  hm)n)lp^nip^\\ 

comm 

[Ip  II  Ifi! p  ®  DFT;72) 

local  perm  comm  local  perm 

{Ip  ®||  ®  Immyp^nip^W  ilf^Iinyp)) 

comm 

[Ip  iSi||  Imlp  ®  DFT^) 


Large  packet  vectorization  and  overlapped  communication  apply  in  a  straightforward  man¬ 
ner  to  2D  DFTs,  and  we  do  not  discuss  those  here  further. 

4.2.5  Communication  Bound  on  Distributed  DFTs 

Since  performing  a  distributed  FFT  involves  a  significant  amount  of  inter-PE  communication, 
the  interconnect  topology  and  characteristics  will  have  a  major  impact  on  performance.  In  this 
section,  we  present  a  model  to  evaluate  and  hound  the  performance  of  distributed  parallel  ID 
and  2D  FFTs. 

We  introduce  the  following  notation: 

•  B  =  Bisection  bandwidth  (effective  bandwidth  we  get  when  doing  an  all-to-all  communi¬ 
cation  across  all  nodes),  measured  in  GB/s. 

•  C  =  Number  of  bytes  per  complex  element. 

•  N  =  Problem  size  of  the  DFT. 

•  S  =  Number  of  communication  stages  (depends  on  the  DFT  type  and  data  distribution;  ID 
DFTs  with  block  distribution  involve  3  stages  ((4.19)),  and  with  block  cyclic  distribution 
with  the  appropriate  block  size  involve  1  stage  (discussed  in  Section  4.2.2).  2D  DFTs  with 
block  distribution  involve  2  stages  ((4.25)). 


82 


Chapter  4.  Designing  Platform-Optimized  FFTs 


•  Tc  =  Runtime  of  all  computation  stages  involved.  Essentially,  the  time  taken  by  the  entire 
DFT  computation  without  the  communication  cost. 

•  Tfn  =  Time  taken  by  the  communication  of  each  stage  (assuming  all  stages  take  the  same 
time). 

•  H  =  Overlap  factor,  defined  thus;  if  we  can  overlap  by  H%,  this  means  H%  of  the  minimum 
time  of  the  computation  and  communication  can  be  hidden,  and  thus  eliminated  from 
runtime. 

Performance  bound.  We  first  find  a  minimum  bound  on  time,  and  use  that  to  find  a  maxi¬ 
mum  bound  on  performance: 


Total  runtime  =  T^  +  iS^  Tm) 

Total  bytes  transferred 
Bandwidth 


Total  bytes  transferred  =  Nx  C 


Tm  — 


NC 


We  use  5N\ogN  as  the  number  of  operations  in  a  DFT  of  size  N^.  Based  on  this  number,  the 
performance  is  given  by: 


Total  number  of  operations 

Performance  = - 

Total  runtime 


SWlogAT 


Tc  +  S  ■  Tm 

^  5N\ogN 

-  rr  ,  SNC 
Jc  +  B 


Whis  is  an  approximate  and  slight  overestimate  of  the  exact  operations  count  which  depends  on  the  exact  FFT 
used,  as  presented  in  [Frigo  and  Johnson,  2005]. 
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We  can  use  the  above  to  estimate  the  performance  of  our  DFTs,  and  as  a  reference  to  ensure 
that  our  performance  is  as  expected. 

4.2.6  Summary 

In  this  section,  we  presented  algorithms  for  latency  optimized  DFTs  where  the  input  and  output 
data  are  distributed  across  the  PEs.  We  summarize  the  algorithms,  paradigms,  and  tags  pre¬ 
sented  in  this  section: 

•  Algorithms  used:  Cooley- Tukey  FFT  (for  ID  DFT),  row-column  (for  2D  DFT). 

•  Algorithms  huilt:  Basic  (small-packet)  ID  and  2D  FFT,  large  packet  and  vectorized  ID  and 
2D  FFT. 

•  Paradigms  used:  Parallel  paradigm. 

•  Tags  used:  A  tag  to  parallelize  using  the  basic  (small-packet)  parallelization  algo- 

par(p,^) 

rithm,  and  A  to  parallelize  using  our  large-packet  algorithm. 

parBig(p) 

4.3  Latency  Optimized  DFTs  Streamed  from  Memory 

Our  distributed  FFTs  were  based  on  the  assumption  that  the  input  and  output  data  for  the  DFT 
fits  into  the  combined  local  memories  of  the  PEs.  When  the  target  platform  also  contains  a  mem¬ 
ory  node,  we  can  also  compute  larger  size  DFTs  that  only  fit  on  the  memory  node.  An  example 
for  this  case  is  the  Cell  processor  where  we  may  want  to  compute  DFTs  that  do  not  fit  on-chip, 
but  only  in  main  memory.  In  this  case,  we  must  compute  the  DFT  by  working  in  steps  on  smaller 
chunks  of  the  problem  that  fit  across  local  memories  of  the  PEs.  Strategies  for  doing  so  while  still 
achieving  high  performance  are  not  trivial,  for  a  few  reasons.  First,  the  FFT  intrinsically  requires 
multiple  passes  through  the  entire  data  for  computation.  This  means  algorithms  that  reduce  the 
number  of  passes  and  make  most  use  of  data  locality  are  required.  Second,  most  memory  sys¬ 
tems  require  data  access  in  chunks  to  achieve  high  bandwidth.  Hence,  algorithms  must  satisfy 
this  constraint  too.  Third,  algorithms  should  minimize  the  costs  of  shipping  data  back  and  forth 
(called  streaming)  between  the  PEs  and  main  memory  by  overlapping  them  with  computation. 
We  address  these  challenges  in  this  section. 

We  first  show  how  to  perform  DFTs  streamed  from  the  storage  node  to  a  single  PE,  followed 
by  strategies  to  use  multiple  PEs  (all  connected  to  a  shared  storage  node).  Finally,  we  present  a 
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model  for  estimating  performance  limits  on  systems  where  performance  is  limited  by  memory 
bandwidth.  Throughout  this  section,  we  use  the  Cell  processor  as  an  example  target  architec¬ 
ture  to  illustrate  architectural  features  and  tradeoffs  that  we  can  expect  to  find  in  other  similar 
distributed  memory  processors. 

4.3.1  Streamed  FFTs 

We  discuss  how  to  factorize  the  DFT  so  we  can  compute  large  DFTs  using  our  streaming  tech¬ 
nique  of  interpreting  tensor  products  as  streamed  loops  as  presented  in  Chapter  3.  Since  we  tar¬ 
get  sizes  where  the  entire  input  vector  does  not  fit  into  the  combined  PEs’  memory  (henceforth 
called  “local  memory’’),  we  must  bring  the  vector  in  parts  from  the  storage  node  (henceforth 
called  “main  memory”)  to  the  local  memory.  However,  because  every  every  point  on  the  DFT’s 
output  vector  depends  on  every  point  on  the  input  vector,  at  least  one  intermediate  write  and 
read  from  main  memory  are  required.  We  define  a  stage  as  a  load  of  the  entire  input  vector  (in 
parts)  from  main  memory  into  local  memory,  followed  by  computation,  followed  by  a  store  of 
the  entire  output  vector  to  main  memory. 

The  Cooley-Tukey  (4.1)  and  the  row-column  (4.2)  algorithm  can  both  be  easily  streamed  in 
two  stages,  since  they  consist  of  two  tensor  products,  each  of  which  is  a  basic  SPL  construct  that 
can  be  streamed.  We  discuss  two-stage  streaming  first,  and  then  motivate  the  need  for  using 
additional  stages. 

4.3.1. 1  ID  DFT:  Streaming  in  Two  Stages 

As  mentioned  above,  our  approach  of  streaming  the  ID  DFT  in  two  stages  is  straightforward, 
and  consists  of  streaming  each  of  the  factors  of  the  Cooley-Tukey  algorithm.  We  show  ID  DFT 
streaming  in  two  stages  below: 

DFT^„  ^  (DFr^  ^In)ilm  ®  DFT  J 
stream(i/')  stream(i^)  stream(i/')  stream(i//) 

DFr„  ®Inls){LT^Inls)  Ut  ®  DFT„) (/; 

stage  2  stage  1 

^'[LZ'MnlsWs  DFr„  0ln,s)[LZ'®Inls)  Uml t  0lm, t). 

(4.13) 

We  visualize  two-stage  streaming  in  Figure  4.4.  In  the  first  stage,  data  is  read  at  a  stride  from 
the  input  vector  in  parts,  and  written  to  a  temporary  intermediate  vector  at  a  unit  stride.  In  the 
second  stage,  data  is  read  at  a  stride  from  this  intermediate  vector  in  parts  and  written  to  the 
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I _ I _ M  ^ _ r 

stage  2  Stage  1 


Figure  4.4:  ID  DFT  streamed  from  memory  in  two  stages. 


final  output  vector  at  a  stride. 

Packet  size  constraints.  (4.13)  has  to  satisfy: 

mlt>y/  (4.14) 

we  assume  y/  is  the  minimum  packet  size  used  for  loads  and  stores  to  main  memory  required 
by  our  streaming  paradigm. 

Note  that  we  could  have  chosen  different  packet  sizes  for  each  of  the  factors.  For  instance, 
because  first  stage  writes  back  data  to  main  memory  at  very  large  block  sizes  and  thus  has  a 
low  cost,  we  could  have  allowed  it  to  read  its  data  at  even  smaller  packet  sizes.  For  the  Cell,  the 
performance  advantage  gained  by  doing  so  is  minimal  (empirically  determined),  which  means 
we  use  the  same  packet  size  constraints  for  all  SPL  constructs.  This  may  not  hold  true  for  other 
platforms. 

Kernel  size  constraints.  The  maximum  problem  size  for  which  the  DFT  can  be  streamed 
using  two  stages  is  dictated  by  the  maximum  size  of  the  kernel  that  will  fit  into  the  PE’s  local 
memory.  Each  of  the  chunks  streamed  in  must  fit  into  the  local  memory.  If  C  is  the  maximum 
DFT  kernel  size  that  will  fit  in  local  memory,  this  means: 

mnls<C,  mnlt<C  (4.15) 

in  (4.13).  These  constraints  ensure  that  the  first  and  second  factor  in  (4.13),  respectively,  fit  into 
local  memory. 

Streaming  loop  constraints.  As  discussed  in  Section  3.4.2,  we  need  at  least  8  iterations  to  get 
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reasonable  overlap  in  practice,  which  yields  the  following  additional  constraints: 

s>8,  f>8  (4.16) 


Constraints  on  the  DFT  input  size.  The  packet  size,  kernel  size,  and  streaming  loop  con¬ 
straints  impose  lower  and  upper  bounds  on  DFT  problem  sizes  that  can  be  performed  using  the 
two  stage  streaming  algorithm.  We  derive  these  bound  below. 

Lower  bound  on  N,  the  DFT  size,  using  (4.14),  (4.15),  and  (4.16): 


N  =  mn 

>  y/txlrs  =  xf/^st 
N  >  64i//^ 


Upper  bound  on  N: 


s,t> 

n  m 
y/  y/ 


nm 


-y/  > 


N  y 
—y/  > 
/  ^ 


N 

C 

N 

C 


^N< 


1//2 


(4.17) 


(4.18) 


Lemma  1  The  lower  and  upper  bounds  of  the  two  stage  FFT  (4.13)  on  input  size  N  are  given  by: 

Q^y/^  <N<C^ly/^ 


To  obtain  bounds  on  the  input  size  for  the  Cell,  we  substitute  C  and  yr  for  the  Cell  into 
Lemma  1.  For  the  Cell,  yr  must  be  128  bytes,  or  16  single  precision  complex  elements,  as  demon¬ 
strated  in  Figure  3.6.  The  SPE’s  local  memory  is  256kB  (2^^  bytes),  which  can  fit  4  single  precision 
complex  element  vectors  (for  input,  output,  twiddles,  double-buffer)  of  size  2^^  each.  Since  we 
also  need  space  for  code,  the  largest  two -power  DFT  problem  size  we  can  perform  using  one  SPE 
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and  the  two -stage  streaming  algorithm  is  C  =  2^^.  Using  these  values  in  Lemma  1,  we  obtain: 

2^^  <  Ar<2^'^ 

As  we  can  see,  the  two-stage  algorithm  in  this  case  is  useful  for  only  a  very  narrow  range  of  DPT 
problem  sizes. 

The  need  for  additional  stages.  As  shown  above,  the  combination  of  packet  size  and  kernel 
size  constraints,  along  with  the  number  of  streaming  loop  iterations  required  limits  the  problem 
size  of  the  DFTs  that  can  me  streamed  using  the  two-stage  FFT  in  (4.13). 

We  solve  this  problem  by  further  breaking  down  and  streaming  the  smaller  DFTs  in  the  Cooley- 
Tukey  FFT,  which  allow  streaming  larger  DFTs  at  the  cost  of  additional  stages.  On  platforms 
where  the  achieved  memory  bandwidth  has  a  linear  relationship  with  packet  size,  there  may 
be  a  degree  of  freedom  in  the  number  of  stages  that  can  be  used  for  certain  DFT  problem  size 
ranges.  The  number  of  stages  used  must  thus  be  optimized:  using  too  few  stages  may  result  poor 
bandwidth  due  to  small  packet  sizes,  while  too  many  stages  may  result  in  poor  performance  due 
to  the  overhead  of  each  stage. 

Next,  we  show  an  algorithm  for  streaming  the  DFT  using  three  stages. 


4.3.1.2  ID  DFT:  Streaming  in  3-stages  Using  Vector  Recursion 

We  can  factorize  the  CT-FFT  into  3  stages  using  vector  recursion,  as  done  in  [Frigo  and  Johnson, 
2005].  We  show  3-stage  vector  recursion,  followed  by  how  it  can  be  used  with  streaming: 


DFTj;^„  — >•  (DFTj;  <S)  DFT„jn)Lj. 

-  (DFTfc  ^Imn)Dk,mn  [h  ®  (DFT^^  Um  ®  DFTJL™”) 

-  (DFTfc  0lmn)Dk,mn  [h  ®  DFT,„  ^In)Drrr,n)  Uk  ^  {im  ®  DFTJL™") 

-  (DFT°  [Ik  ®  DFr„  0ln))  Uk  ®  [Im  ®  DFTJL™") 

-  (DFT°  [Ik  ^  DFr„  ®7„)  ^  7„)  [l^  ®  Uk  ®  DFT„)L;^")  (L™”  ^  Ik) 


Using  vector  recursion  above,  we  can  now  stream  in  three  stages: 
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1  Write 

Compute 

Read  Write 

Compute 

Read  Write 

Compute 

Read  I 

1 

n 
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Stage  2 

! 

Stage  1 

Figure  4.5:  ID  DFT  streamed  from  memory  in  three  stages. 


stage  3 


stage  2 


stage  1 


DFT^  -  (DFT;  ®  /„)  (/„  ®  (4  ®  DFT„)Lf  J  (L™”  ®  h) 

stream(i/A) 


stream  (i/a) 


stream(i/A) 

stage  3 


stream{i/A) 


stage  2 


{Lf^Imnlb)[h  (DFXfc  ^Imnlb))il^b  ^Imnlb)  4  (DFT„  ®4) 

stage  1 


[im  (4  ®  DFT„)I^")  (L”"<»4)  (4.19) 
Similar  to  two-stage  streaming,  we  visualize  three-stage  streaming  in  Figure  4.5. 


Constraints.  Again,  we  show  constraints  for  each  stage  based  on  the  maximum  kernel  size 
that  will  fit  into  local  memory  (shown  using  C)  and  packet  sizes  used  for  moving  data,  and 
streaming  loop  requirements; 


Constraints  for  stage  1  in  (4.19): 

(Kernel  size  restriction) 

(Minimum  packet  size  restriction) 

(Minimum  streaming  iterations)  (4.20) 


kn<C 

n>y/,  k>yf 
m>8 
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Constraints  for  stage  2  in  (4.19): 

mn<C  (Kernel  size  restriction) 

mn>y/  (Minimum  packet  size  restriction) 

k>8  (Minimum  streaming  iterations)  (4.21) 

Constraints  for  stage  3  in  (4.19): 

kmni  b<C  (Kernel  size  restriction) 

mn/ b>y/  (Minimum  packet  size  restriction) 

b>8  (Minimum  streaming  iterations)  (4.22) 


Constraints  on  the  DFT  input  size.  Again,  we  derive  lower  and  upper  bounds  on  the  DFT 
problem  size  N  based  on  the  packet  size,  kernel  size,  and  the  streaming  loop  constraints  for  the 
three  stage  algorithm  (4.19)  based  on  (4.20),  (4.21),  and  (4.22): 

Lower  bound  on  N,  the  DFT  size: 

N  =  kmn 
N  N  N 
b’  k’ m  ^ 
mn  >  8y/ 

k>y/ 

^N>8y/^  (4.23) 

Upper  bound  on  N: 

U  1  ^ 

b,k,m>  — 

C 

mn 

- >  b 

xp 

mnkm 

^ - >  — 
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Lemma  2  The  lower  and  upper  bounds  of  the  three  stage  FFT  (4.19)  on  input  size  N  are  given  by: 


8y/^  <N<^ 


To  obtain  bounds  on  the  input  size  for  the  Cell,  we  substitute  C  and  if/  for  the  Cell  into 
Lemma  2: 

2^^  <N<2^^ 

As  we  can  see,  the  three-stage  FFT  {(4.19))  can  stream  a  wider  range  of  input  sizes  than  the  two- 
stage  version  ((4.13)). 

4.3. 1.3  2D  DFT  Streaming 

We  can  use  the  Row  Column  algorithm  in  (4.2)  to  stream  2D  DFTs  in  two  and  three  stages.  Two- 
stage  streaming  is  straightforward: 


DFTfc^xnr  j  OFT^r  j  (4.25) 

stream(i/A)  stream(i/A)  stream(i/A) 

A  similar  set  of  constraints  can  be  constructed,  as  we  did  the  for  ID  case.  Assuming  con¬ 
straints  are  met,  the  factors  above  can  be  streamed  using  standard  streaming  rules.  For  larger 
sizes,  when  kmy/  does  not  fit  into  the  local  memories,  we  can  further  factorize  the  left  factor  by 
applying  the  Cooley- Tukey  breakdown  to  the  DFT  inside  it: 


^^T^kmxnr  \jkm  ^  DFT^,-  j 

stream  (i//)  stream  stream  (i//) 

-  (  ((DFT;  ^ImWk  ®  DFT„)i;^"‘)  ®  Inr)(lkm  ®  DFT„,  ) 

stream(i/A)  stream(i//) 

-  (dFT°  [Uk  ®  DFT;„  ®  In)]  (hm  ®  DFT„,  ) 

stream(i/A)  stream(i/')  stream(i//) 


Now,  the  constraints  are  more  relaxed,  and  2D  DFTs  up  to  a  size  where  kif/  and  mi//  fit  on  the 
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PEs  can  be  streamed.  For  even  larger  sizes,  or  rectangular  sizes  where  km^f/  fits,  but  nr  does  not 
fit,  we  can  similarly  split  the  DFT  on  the  right  factor: 

|dFTj;^  \ljcm  ®  j 

stream(i/A)  stream(i//) 

(oFIfc^  ^I„r]  [ikm  ®  ((DFT°  ^Ir)  Un  ®  DFT,)!”'')) 

stream(i/A)  streamCi^) 

(DFTfc„  0lnr]  (hm  ®  DFT°  ®/,))  [ikm  ®  ((/„  ®  DFT,)!^''))] 

stream(i/A)  streamCi^)  streamCi^) 

-  (DFTfc„  0lnr]  (hm  ®  DFT°  ®/,))  [ikm  ®  Un  ^  DFT,)!”'') 

^  ^  ^  ^  ^  ^  ^  ^  ^ 

stream(i/A)  stream(i/A)  stream(i//) 

Again,  the  constraints  are  relaxed,  and  nr  does  not  have  to  fit  on  the  PEs.  When  needed,  we 
can  perform  both  the  above,  which  yields  a  4-stage  2D  DFT: 

(DPr^  ®/„„)  ((4  ®  DFT;,,  ®4,)(!fc'"  ®  I„)] 

stream  (i//)  stream  (i/a) 

[ikm  ®  DFT;  ® /,))  [ikm  ®  Un  ®  DFT,)!”'') 

stream  (i/a)  stream  (i/a) 


DFTfc 

mxnr 

stream(i/A) 


stream  (i/a) 


4.3.2  Combining  Streaming  with  Parallelism 

So  far,  our  streaming  techniques  were  restricted  to  using  a  single  PE.  In  this  section,  we  show 
how  to  combine  our  streaming  techniques  with  our  parallelization  techniques  for  distributed 
memory,  so  that  we  can  stream  using  multiple  PEs. 

Each  of  our  basic  SPL  constructs  consists  of  a  loop  over  a  kernel.  In  essence,  when  combining 
streaming  with  parallelism,  we  must  redistribute  the  iterations  of  this  loop  to  either  different 
PEs  (parallelization),  or  to  streamed  loops.  The  order  in  which  we  redistribute  loop  iterations 
gives  rise  to  two  methods  of  combining  these  paradigms.  We  present  two  methods  here,  called 
ParStreamCore,  and  StreamParChip. 
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First,  we  note  that  when  performing  both  streaming  and  parallelism,  the  rewrite  rules  for 
each  must  have  knowledge  of  the  other  paradigm  being  applied.  Thus,  the  rewrite  rules  for  ap¬ 
plying  each  of  these  paradigms  are  not  perfectly  orthogonal.  An  example  demonstrates  this  is¬ 
sue: 


In  ® 

par^p,^l) 


stream(i/') 


pai(p,p.) 


In  the  case  above,  if  A  is  not  parallelizable,  then  parallelization  fails.  If  A  is  parallelizable, 
we  might  still  not  achieve  maximum  performance.  This  is  because  the  streaming  rewrite  rule 
“consumed”  all  available  iterations,  leaving  none  for  the  parallelization  rules  to  act  on.  On  the 
other  hand,  if  the  streaming  rule  was  aware  that  parallelization  across  p  PEs  would  be  performed 
on  its  result,  the  process  would  have  proceeded  thus: 


In  ®  Am  Inip  ®=Ip^A 
parlp.p)  parip.p) 

stream(i//) 


Inip  Ip 


Thus,  we  assume  that  rewrite  rules  for  each  paradigm  is  aware  of  tags  and  tag  parameters 
from  the  other  paradigm  that  will  later  be  applied. 

ParStreamCore.  The  first  method  results  in  each  PE  operating  independently  and  asyn¬ 
chronously  with  respect  to  other  PEs  during  each  streaming  stage:  each  PE  is  assigned  a  set 
of  data  points  of  the  vector  resident  in  main  memory,  and  directly  reads  these  data  points  from 
main  memory,  computes  on  them,  and  writes  them  back  to  main  memory  without  interacting 
in  any  manner  with  the  other  PEs.  This  is  visualized  in  Figure  4.6.  In  the  resulting  code,  the 
outer  loop  is  the  parallel  loop,  while  the  inner  loop  is  the  streaming  loop.  Accordingly,  we  con¬ 
struct  algorithms  that  use  this  method  by  first  removing  the  parallel  tag,  and  then  removing  the 
streaming  tag,  leading  to  the  desired  loop  nesting: 


parstreamcoreO 


In  ®  Ip  ®||  Ib  ilnlbp  ® 

stream(i/A) 


nm 


(4.26) 
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Figure  4.6:  ParStreamCore  algorithm  visualization.  Each  PE  works  on  independent  chunks  from  main 
memory,  thus  requiring  low  synchronization  overhead. 


parstreamcoreO 


stream(i//) 


par(p,p) 


,  mpb 


^nlpb)  {jp  ®  II  Ib  (^m  ^  Inipb')'}  ®  ^nlpb) 


n 


(4.27) 


Note  that  there  are  no  constraints  on  /i,  because  this  method  does  not  involve  data  exchanges 
directly  between  the  PEs. 

ParStreamCore:  Constraints.  Assuming  C  is  the  maximum  kernel  size  that  will  fit  on  a  single 
core,  our  constraints  are; 


mn/ pb  <  C  (kernel  size  constraint) 

mnipb  >  y/  (memory  packet  size  restriction,  non-strided) 

n/pb>y/  (memory  packet  size  restriction,  strided) 

StreamParChip.  In  the  second  type,  the  PEs  collectively  bring  in  data  that  fits  on  the  com¬ 
bined  local  memories.  The  kernel  is  then  computed  in  parallel  across  the  PEs.  Data  must  be 
redistributed  between  the  PEs  after  the  load  from  memory,  before  the  store  to  memory,  or  both. 
This  is  visualized  in  Figure  4.7.  In  the  resulting  code,  the  outer  loop  is  the  streaming  loop  while 
the  inner  loop  is  the  parallel  loop.  Accordingly,  we  construct  algorithms  that  use  this  method  by 
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Figure  4.7:  StreamParChip  algorithm  visualization.  The  main  idea  is  to  add  an  “on-chip  stage,”  resulting 
in  the  benefit  of  an  additional  stage  without  the  full  additional  cost  of  one.  Larger  packet  sizes  to  main 
memory  (and  smaller  on-chip  packets)  can  he  used  with  this  algorithm.  However,  the  PEs  must  he  syn¬ 
chronized  several  times  within  each  stage. 


first  removing  the  streaming  tag,  and  then  removing  the  parallel  tag,  leading  to  the  desired  loop 
nesting: 


streamparchipO 


par{p,^) 


stream  (y') 


Ijj  iSi=  Ip  iSi||  ilfilbp  ® 


(4.28) 


StreamparchipO 


{LZ^  ®  Inlb)  [h  ®  ®  Inlbp)Ip  ®  II  iAm  ®  hlbp)  (LZ/T  ®  ^nlbp))]  (LZ^  ^  hlb) 


par(p,p) 

stream{i//) 


[LZ^  ®  Inlb)  [h  )  {LZ^  ®  Inlb) 

par(p,p) 


(4.29) 


StreamParChip:  Constraints.  Assuming  C  is  the  maximum  kernel  size  that  will  fit  on  a  single 
core,  our  constraints  are: 
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mnibp  <  C 
nib  >  nib 
nibp  >  nibp 


(kernel  size  constraint) 
(memory  packet  size  restriction) 
(parallel  packet  size  restriction) 


The  main  advantage  of  the  StreamParChip  method  is  that  because  it  takes  advantage  of  the 
combined  local  memories  of  all  the  PEs,  the  size  of  the  kernels  chunk  used  for  computation  scales 
with  the  number  of  PEs.  Consequently,  the  problem  size  of  the  DFTs  that  can  be  computed  using 
a  given  number  of  stages  (before  an  additional  stage  is  needed)  also  scales  with  the  number  of 
PEs.  The  main  downside  is  the  requirement  of  inter-PE  data  exchanges,  which  leads  to  one  or 
two  synchronization  barriers  per  iteration. 

Another  way  to  look  at  the  StreamParChip  method  is  the  following:  we  introduce  an  addi¬ 
tional  read/write  stage,  except,  this  stage  is  performed  between  the  PEs’  local  memories  (as  op¬ 
posed  to  main  memory),  and  individually  on  each  computed  chunk.  Therefore,  it  provides  us 
with  some  of  the  benefits  of  an  additional  stage  (including  increased  packet  sizes  to  main  mem¬ 
ory),  while  not  incurring  the  full  cost  of  an  additional  stage. 

In  general,  the  two  methods  (ParStreamCore,  StreamParChip)  trade  main  memory  data  trans¬ 
fer  packet  size  for  the  cost  of  inter-PE  data  exchange  and  synchronization.  This  works  particu¬ 
larly  well  on  platforms  like  chip-based  distributed  memory  systems  where  the  bandwidth  be¬ 
tween  PEs  far  exceeds  the  main  memory  bandwidth. 

4.3.2.1  Streaming,  Parallelized  Algorithms  for  Large-Sized  ID  and  2D  DFTs 

We  can  now  build  algorithms  for  streamed,  parallel  large-sized  ID  and  2D  DFTs  using  the  follow¬ 
ing  two  steps: 

1.  We  apply  a  breakdown  rule  that  breaks  down  the  ID  or  2D  DFT  into  two  or  three  stages, 
using  our  algorithms  presented  earlier.  Degree  of  freedom:  the  number  of  stages  to  pick. 

2.  We  apply  our  parallelization  and  streaming  technique  to  the  factors.  Degree  of  freedom: 
the  technique  to  use  (ParStream  or  StreamPar)  for  each  factor. 

Some  choices  within  the  degrees  of  freedom  are  resolved  by  the  constraints  imposed  by  each 
step.  For  example,  a  choice  cannot  made  in  the  first  step  that  will  result  in  a  dead  end  in  the 
second  step  due  to  packet  size  or  kernel  size  constraints  being  violated.  These  choices  can  be 
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eliminated  by  representing  the  algorithm,  packet  size,  and  kernel  size  constraints  as  a  set  of  lin¬ 
ear  equations,  and  solving  them  to  determine  the  minimum  number  of  stages  required  for  a 
particular  problem  size  on  a  given  platfom.  The  remaining  choices  in  the  degrees  of  freedom 
form  a  small  search  space,  though  which  we  search  empirically  by  generating  all  variants. 

4.3.3  Memory  Bandwidth  Bound  on  Streamed  DFTs 

When  computing  large  sized  DFTs  on  systems  with  a  memory  node,  the  maximum  possible  per¬ 
formance  is  bound  not  only  by  the  peak  compute  performance,  but  also  by  the  bandwidth  be¬ 
tween  the  PEs  and  main  memory.  In  this  section,  we  develop  a  model  to  determine  upper  bounds 
on  DFT  performance  for  a  platform  based  on  its  architectural  parameters,  to  help  us  predict  and 
evaluate  the  performance  of  our  generated  DFTs  on  that  platform. 

For  the  FFT,  as  the  problem  size  grows  larger,  the  ratio  of  the  number  of  compute  operations 
to  the  number  of  loads  and  stores  grows  asymptotically  by  a  factor  of  O(logn).  For  larger  prob¬ 
lem  sizes  that  need  to  be  streamed  in  from  main  memory,  this  implies  that  as  the  problem  size 
increases,  so  must  the  performance  of  the  FFT,  as  measured  in  flop/s,  up  to  the  architecture’s 
peak  performance  limit.  However,  an  important  factor  prevents  this.  As  mentioned  earlier,  the 
effective  memory  bandwidth  achieved  typically  decreases  with  a  decrease  in  packet  size.  Packet 
sizes  decrease  with  an  increase  in  DFT  problem  size  for  a  given  number  of  stages.  At  the  point 
where  low  packet  sizes  severely  constrain  performance,  an  extra  stage  can  be  added  (at  a  cost)  to 
the  DFT  to  increase  packet  size. 

Cost  of  adding  a  stage.  Each  stage  involves  reading  an  entire  vector  the  problem  size,  com¬ 
puting,  and  writing  it  back.  If  performance  is  already  bound  by  memory  bandwidth,  we  cannot 
hide  the  costs  of  the  additional  loads  and  stores  caused  by  the  added  stage.  The  point  at  which  it 
becomes  better  to  use  an  additional  stage  can  be  determined  using  our  model  below. 

Handling  twiddle  factors.  Since  precomputed  twiddle  factors  for  a  large  problem  size  can 
themselves  be  too  large  to  be  distributed  and  stored  on  the  local  memories,  we  have  two  alter¬ 
natives  to  handling  twiddle  factors:  (a)  store  them  in  main  memory  and  stream  them  in  along 
with  the  data,  and  (b)  if  we  are  operating  well  below  the  compute  peak,  we  can  compute  them 
on  the  fly  without  affecting  overall  performance.  Although  we  discuss  these  in  more  detail  in 
Chapter  5,  we  make  a  note  of  twiddle  factors  here  because  our  model  must  take  into  account 
streamed  twiddles. 

Memory  bandwidth  bound  on  streamed  DFTs.  We  will  now  see  how  to  obtain  a  performance 
bound  for  DFTs  based  on  memory  bandwidth. 

•  B  =  Memory  bandwidth  for  both  reads  and  writes  [in  Bytes  per  second  (B/s)].  For  now. 


4.3.  Latency  Optimized  DFTs  Streamed  from  Memory 


97 


we  assume  a  simple  memory  bandwidth  model,  where  we  can  achieve  full  memory  band¬ 
width  above  a  certain  packet  size  threshold,  and  are  unable  to  access  memory  below  the 
threshold 

•  C  =  Bytes  per  complex  element  [in  Bytes  per  complex  element] 

•  s  =  Number  of  stages 

•  t  =  Number  of  twiddle  stages 

•  N  =  DFT  size 

•  n  =  DFT  log  size  (n  =  log(n)) 

Total  number  of  stages:  (s  +  0.5 1]  (each  twiddle  stage  involves  only  a  read,  and  no  write,  and 
hence  accounts  for  0.5  times  the  cost  of  a  regular  stage). 

Each  stage  in  S  reads  2"  complex  elements,  and  writes  2”  complex  elements. 

We  assume  our  performance  is  bound  by  the  lower  of  the  memory  bandwidth  and  the  peak 
compute  performance. 

Total  data  transferred  per  stage:  complex  elements,  or  C  •  bytes. 

Time  taken  per  stage:  seconds 

Since  we  assume  we  are  bound  by  memory  bandwidth,  a  DFT  of  size  log(n)  is  completed  in 
the  amount  of  time  above.  So  its  performance  is: 


5  *  n  *  2” 

“  (U+0.5f)*2"+‘*C)  ® 

B 


5nB 

2C(s  +  0.5f) 


flop/s. 


(4.30) 


To  summarize,  computing  memory  bandwidth-bound  FFT  performance  on  a  given  architec¬ 
ture  consists  of  two  parts: 


1.  Determine  the  number  of  stages  involved,  and  identify  stages  that  require  streamed  twid¬ 
dles. 


2.  Use  (4.30)  to  compute  bandwidth-bound. 
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4.3.4  Streaming  on  Cache-based  Architectures 

The  streaming  techniques  we  presented  can  be  used  on  cache-based  architectures.  Although  we 
do  not  focus  on  or  use  cache-bases  architectures  in  this  thesis,  we  briefly  discuss  the  applicability 
of  our  techniques  to  cache-based  architectures.  The  difference  is,  we  do  not  have  fine  control 
over  what  is  on-chip  and  what  is  not  at  any  given  point.  To  get  around  this,  we  could  use  a 
combination  of  several  techniques.  First,  we  could  use  standard  buffering  techniques  to  avoid 
accessing  memory  in  large  strides  (which  can  map  to  the  same  lines  in  cache  and  cause  conflict 
misses).  To  buffer,  we  copy  our  data  elements  which  we  would  have  accessed  in  strides  into  a 
smaller  array,  and  then  perform  operations  on  the  array,  and  finally  copy  the  array  back  into 
the  original  array.  In  addition,  if  the  architecture  supports  it,  we  could  use  streaming  reads  and 
writes  on  the  main  array  to  ensure  that  they  do  not  pollute  the  cache,  but  just  get  copied  into  our 
smaller  array  upon  access. 

4.3.5  Summary 

In  this  section,  we  presented  algorithms  for  latency  optimized  DFTs  where  the  input  and  output 
data  are  resident  in  main  memory.  We  summarize  the  algorithms,  paradigms,  and  tags  presented 
in  this  section: 

•  Algorithms  used:  Cooley- Tukey  TFT  (for  ID  DFTs),  row-column  FFT  (for  2D  DFTs),  vector 
recursion  for  breaking  down  the  Cooley- Tukey  algorithm  into  three  stages. 

•  Algorithms  built:  1-3  stage  ID  and  2D  FFTs,  and  4-stage  2D  FFTs  for  streaming  using  a 
single  PE.  In  addition,  we  presented  two  techniques  (ParStreamCore  and  StreamParChip) 
to  combine  streaming  and  parallelism,  leading  to  several  algorithms  that  use  one  or  both 
of  these  techniques. 

•  Paradigms  used:  Parallel  paradigm,  streaming  paradigm. 

•  Tags  used:  tag  to  stream.  and  to  parallelize  and  stream 

stream(i//)  parstreamcoreO  streamparchipO 

an  SPL  construct. 


4.4  Throughput  Optimized  DFTs 

So  far,  our  FFTs  have  been  optimized  for  latency.  That  is,  by  default,  we  optimize  the  cost  of 
reading,  computing,  and  writing  out  a  single  transform.  We  can  also  optimize  our  libraries  for 
throughput,  which  is  useful  for  applications  that  must  compute  a  large  number  of  DFTs  one 


4.4.  Throughput  Optimized  DFTs 


99 


after  the  other,  with  no  intervening  computation.  In  this  thesis,  we  only  focus  on  throughput 
optimization  when  a)  the  DFT  input  size  is  small  enough  to  fit  among  the  combined  memories 
of  the  PEs,  and  h)  when  we  have  a  memory  node,  and  are  optimizing  for  throughput  for  a  stream 
of  DFTs  of  the  same  problem  size  that  are  stored  in  the  memory  node. 

Optimizing  for  throughput  primarily  involves  minimizing  the  costs  of  data  transfer  from 
main  memory  to  the  PEs  by  overlapping  computation  with  communication.  Significant  gains 
in  performance  cannot  be  achieved  by  optimizing  large  (larger  than  cache  or  local  storage)  sizes 
for  throughput,  since  large  sized  DFTs  are  computed  in  parts  from  main  memory,  where  we  al¬ 
ready  overlap  computation  with  communication  as  described  in  Section  4.3. 

The  Cell  BE  is  the  only  target  platform  in  this  thesis  that  has  a  memory  node,  and  thus,  we 
discuss  throughput  optimization  primarily  in  the  context  of  the  Cell  BE. 

Performing  n  multiple  independent  DFTs  (which  allows  for  throughput  optimization)  on 
contiguously  stored  input  vectors  is  expressed  in  SPL  by: 

/„i8>DFTm. 

In  the  remaining  part  of  this  section,  the  distinction  between  ID  and  2D  DFTs  does  not  mat¬ 
ter  since  optimizing  either  for  throughput  involves  the  same  techniques. 

4.4.1  Streaming  Small  DFTs 

In  this  case,  we  assume  that  we  compute  a  large  number  of  small  DFTs  on  contiguous  chunks 
of  the  input  data,  with  data  resident  in  main  memory.  Small  DFTs  are  DFTs  of  problem  sizes 
small  enough  to  fit  into  the  local  memory  of  a  single  PE,  or  in  the  Cell,  inside  a  single  local  store. 
Optimizing  the  streaming  of  small  DFTs  is  easy,  and  we  use  two  tools  to  do  so:  (a)  we  stream 
DFTs  from  an  SPE  to  main  memory,  and  (b)  we  use  multiple  SPEs  in  parallel,  each  working  on  a 
separate  DFT  problem. 

In  the  simplest  case,  we  express  that  we  want  to  stream  s  DFTs  from  a  single  SPE  to  main 
memory  in  SPL  as: 


Is  DFTm 

stream  (i//) 

We  apply  our  rules  from  Table  3.2  to  stream  (4.4.1).  Performance  is  bound  by  the  main  mem¬ 
ory  bandwidth  of  a  single  SPE  in  this  case. 

To  improve  performance,  we  can  can  trivially  parallelize  the  above  and  have  SPE  working  on 
a  separate  DFT.  We  express  this  in  the  following  SPL  expression,  which  computes  p  x  s  DFTs: 
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<8i  =  /p  <8i  II  DFT^ 
par{p,^l) 

'' - S/- - ^ 

stream  (i//) 

This  leads  to  a  loop  nest  with  an  outer  streaming  loop  and  an  inner  parallel  loop.  Below,  we 
show  how  (4.4.1)  translates  to  Z-SPL.  We  note  that  since  the  DFTs  work  on  completely  indepen¬ 
dent  problems,  we  push  the  outer  streaming  loop’s  gather  and  scatter  into  the  inner  parallel  loop, 
using  identities  in  [Franchetti  et  al.,  2005]: 


s-l  /p-1 

^  Sdt(^7)  S  S(h)DFT,„G{h)  GdtC'?) - 

j=0  i=0 

s-1  p-1 

JiSmiq)S{h)DFTmG[h)GBTiq)^ 

j=0  i=0 

s-1  p-1 

X  ^dtC^IDFIot  Gdt('7)  (4.31) 

j=0  1=0 


Note  that  in  the  above,  we  eventually  fused  the  outer  sum’s  scatter  and  gather  with  the  inner 
sum’s  scatter  and  gather,  as  described  in  [Franchetti  et  al.,  2005]. 

Performance.  Performance  in  our  second  case  is  bound  by  the  total  off-chip  memory  band¬ 
width.  Notice  that  because  of  our  loop  ordering  above  (the  parallel  loop  is  the  inner  loop),  our 
data  access  is  performed  in  large  chunks.  Since  main  memory  systems  are  typically  optimized  for 
this  type  of  access,  we  can  expect  to  achieve  high  memory  performance,  bound  by  main  memory 
bandwidth  specifications. 

Note  that  here  is  a  limit  on  the  number  of  SPEs  we  can  parallelize  across  before  we  saturate 
the  main  memory  bandwidth.  Also,  we  note  that  the  main  memory  bandwidth  bound  increases 
as  the  problem  size  increases.  This  is  because  the  computation  (5nlog(n))  to  load/store  ratio 
(2n)  for  the  DFT  increases  as  the  size  increases.  This  is  reflected  in  the  bandwidth  bound  line  on 
plot  Figure  6.7(a)  in  Chapter  6. 


4.4.2  Streaming  Medium- Sized  DFTs 

Medium-sized  DFTs  are  those  that  do  not  fit  into  the  local  store  of  a  single  SPE,  but  fit  across  the 
combined  local  stores  of  multiple  SPEs.  In  this  case,  we  parallelize  the  DFT  across  the  SPEs,  and 
stream  such  parallelized  DFTs  from  main  memory.  In  SPL,  this  is: 
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Is  ®= 

par(p,^) 
stream  (i//) 

Since  we  parallelize  the  DPT  in  this  case,  we  end  up  with  two  parallel  loops  obtained  hy 
breaking  down  the  DPT  with  our  parallelization  rules.  At  this  stage,  we  can  optionally  apply 
the  optimization  we  applied  to  small  DPTs:  we  can  push  in  the  DMA  scatter  and  gather  into  the 
parallel  loops,  and  combine  them.  However,  by  doing  so,  while  performance  may  improve  in 
some  cases,  it  might  become  worse  in  other  cases.  The  reason  for  this  is  because  when  in  the 
outermost  sum,  the  DMA  gather  and  scatter  read  and  write  in  large  packet  size,  while  pushing 
them  and  fusing  them  with  the  parallel  sum’s  gather  and  scatter  cause  them  to  now  read  and 
write  memory  packets  in  small  sizes: 


5-1 

^  Sdt(<7) 
j=o 


'p-\ 


p-i 


X  S(h)DPT„G{h)X  S(h)DPT„G{h) 
,;=0  1=0 

s-1  /p-1 


p-i 


%  X  SDT(^7)DPT„G(h)X  S(h)DPT^GDT(^7) 

j=0  \ 1=0  i=0 


(4.32) 


We  perform  this  optional  final  step  only  when  performance  is  increased  as  a  result.  This 
can  either  be  determined  experimentally,  or  be  determined  by  a  minimum  packet  size  threshold 
which  must  be  respected  by  the  fusing  rule. 


4.5  Rewriting  for  Usage  Scenarios 

We  have  now  discussed  our  parallel  and  streaming  paradigms,  and  how  to  generate  DPTs  based 
on  these  paradigms.  As  illustrated  by  Pigure  4.1,  there  are  various  usage  scenarios  we  might  need 
to  generate  code  for.  We  have  discussed  most  of  these  scenarios  in  previous  sections.  In  this 
section,  we  summarize  these  scenarios  and  show  how  we  use  our  tag  guided  rewriting  process 
to  generate  algorithms  for  each  scenario. 

We  can  specify  each  of  the  usage  scenarios  depicted  in  Pigure  4. 1  by  the  appropriate  use 
of  tags,  which  then  guide  the  rewriting  system.  Table  4.1  shows  these  tag  based  specifications 
corresponding  to  the  scenarios  in  Pigure  4.1.  Table  4.1(a)  shows  latency- optimized  scenarios 
(can  be  used  to  compute  a  single  DPT),  which  were  all  discussed  in  Section  4.2  and  Section  4.3. 
Table  4.1(b)  shows  throughput-optimized  scenarios  (available  only  when  performing  5  DPTs), 
which  were  all  discussed  in  Section  4.4. 
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Tagged 

Specification  Description 


(a)  Latency-optimized  usage  scenarios 
DFTm 


vec(v) 

DFTfn 

stream(<//) 

DFTm 

par(p,/j) 

DFTm 

par(p,p) 

stream(<//) 


Single  DPT  on  a  single  PE,  from  local  memory 

Large  DPT  streamed  in  parts  to  a  single  PE,  from  main  memory 

DPT  parallelized  across  p  PEs,  from  local  memory 

Large  DPT  parallelized  across  p  PEs,  from  main  memory,  streamed  in  parts 


(b)  Throughput-optimized  usage  scenarios 

Is  ®  DPTm  Streaming  across  s  DFTs,  using  a  single  PE  (small  DFTs) 

stream(ip) 

/p  ®  0  DFTm  Streaming  across  s  DFTs,  parallelization  across  p  PEs  (many  small  DFTs) 

stream  (ip) 
par(p,p) 

Is  ®  DFTm  Streaming  across  s  DFTs,  each  DFT  parallelized  to  run  on  p  PEs  (mid-sized  DFTs) 

par(p,p) 
stream(ip) 

Table  4.1:  Problem  specification  using  tags  for  DFT  usage  scenarios.  All  DFTs  assume  a  vectorization  tag 
with  innermost  nesting. 


4.6  Chapter  Summary 

In  this  chapter,  we  presented  algorithms  for  generating  code  for  latency  optimized  and  through¬ 
put  optimized  DFTs.  For  latency  optimized  DFTs,  we  first  presented  our  basic  and  large  packet 
parallelization  techniques.  Next,  we  presented  algorithms  for  latency  optimized  DFTs  streamed 
from  main  memory,  followed  by  algorithms  and  techniques  for  streamed,  parallel  DFTs.  In  all 
cases,  we  presented  methods  to  generate  code  that  explicitly  managed  data  transfers  (both  from 
main  memory  to  PEs  and  between  PEs),  and  minimized  their  costs  by  overlapping  them  with 
computation.  We  then  presented  techniques  to  optimize  small  and  medium  sized  DFTs  for 
throughput.  Finally,  we  presented  problem  specifications  that  can  be  used  to  generate  code  for 
a  variety  of  usage  scenarios. 

tn  the  next  chapter,  we  will  describe  the  rest  of  our  system  that  generates  code  from  the 
algorithms  we  presented  here,  and  in  addition,  builds  a  tagging  system  to  allow  the  user  to  create 
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high-level  problem  specifications. 


Chapter  5 


Program  Generation 


In  the  previous  chapter,  we  explained  how  to  adapt  an  FFT  to  the  architectural  paradigms  of  par¬ 
allelization  and  streaming  based  on  coarse  hardware  information  in  the  form  of  a  few  parame¬ 
ters.  This  adaptation  is  implemented  as  a  rewriting  system  and  forms  the  core  of  our  generator. 
In  this  chapter,  we  discuss  the  infrastructure  of  our  generator  huilt  around  this  core.  This  in¬ 
cludes  both  the  generation  of  detailed  input  specifications  from  higher  level  user  specifications, 
and  the  generation  of  an  output  program  from  an  adapted  algorithm  represented  in  Z-SPL.  Our 
infrastructure  uses  and  extends  that  of  the  existing  Spiral.  We  explain  both  our  infrastructure 
and  the  existing  infrastructure  from  previous  work  to  illustrate  the  inner  workings  of  our  system. 
We  end  the  chapter  with  a  brief  example. 

Figure  5.1  shows  the  overall  program  generation  system.  The  input  to  our  system  is  shown 
on  the  top,  and  the  output  is  a  C  program  or  library,  shown  on  the  bottom.  The  labels  on  the  left 
side  show  the  knowledge  our  system  includes  to  process  a  given  stage. 

System  working.  We  give  an  overview  of  the  system  in  Figure  5.1.  A  detailed  explanation 
follows  later. 

•  Input:  The  input  to  our  system  consists  of  DFT  parameters  (e.g.,  input  size),  problem  spec¬ 
ification  (e.g.,  throughput  or  latency  optimization,  number  of  processors  to  use),  and  ar¬ 
chitectural  specification  (e.g.,  programming  and  memory  paradigm,  data  transfer  packet 
size  to  use). 

•  System  specification  generation:  The  first  stage  converts  the  input  into  a  formal  transform 
specification  with  appropriate  tags.  This  is  the  form  that  can  be  processed  by  the  next 
stage.  We  cover  this  in  Section  5.1. 
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Transform  Problem  Architecture 
with  user  tags  specification  specification 


C  code  or  library 


Figure  5.1:  Program  generation  system  overview. 


•  Algorithm  generation  and  rewriting  for  paradigm:  This  stage  is  responsible  for  generating 
algorithms  for  the  specified  problem  which  are  adapted  to  the  specified  paradigm  and 
parameters.  This  stage  is  implemented  as  a  rewriting  system  and  outputs  the  resulting 
algorithm  in  Z-SPL.  The  core  of  these  operations  were  covered  in  Chapter  4.  We  present 
implementation  details  in  Section  5.2. 

•  Code  generation:  Finally,  using  the  architecture  specifications,  the  Z-SPL  algorithm  is  trans¬ 
lated  into  C  code,  possibly  containing  MPl  calls,  DMA  directives,  and  vector  intrinsics.  We 
cover  this  in  Section  5.3,  and  address  some  lower  level  issues  in  Section  5.4. 

•  Output:  The  final  Is  either  an  optimized  C  function  (for  a  fixed  input  size  DFT)  or  a  C  library 
(for  the  general  input  size  DFT) . 


5. 1  System  Specification  Generation 


In  this  section,  we  describe  the  first  stage  or  block  in  Figure  5.1.  As  shown,  the  input  consists  of 
three  parts:  the  tagged  transform,  a  problem  specification,  and  an  architecture  specification.  We 
discuss  each  of  these  parts  in  detail  below,  and  then  show  how  to  use  them  to  generate  system 
specifications  that  can  be  used  by  the  second  stage  of  our  system. 
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Tag 

Description 

Parameters 

No-tag  default  assumption 

A 

Parallelize 

p  (optional) 

Run  on  single  PE 

parallelizeuCp) 

memoryuO 

Vectors  in  main  memory 

(none) 

Vectors  in  local  memory 

Table  5.1:  User  tags  as 

indicated  by  the  subscript  u. 

5.1.1  User  Tags  and  System  Tags 

The  scenarios  shown  in  Section  4.5  are  all  relevant  but  also  pose  a  problem  for  the  use  of  our 
system.  First,  the  user  is  required  to  have  knowledge  of  the  system:  specifically,  what  problem 
sizes  would  not  fit  in  the  local  memories,  and  what  number  of  PEs  would  yield  the  fastest  perfor¬ 
mance  for  a  particular  problem  size.  Second,  the  scenarios  make  implicit  assumptions  about  the 
location  of  the  input  and  output  data.  For  example,  they  assume  that  transforms  small  enough 
to  fit  in  local  memory  are  stored  in  local  memory,  which  may  not  be  the  case.  Third,  as  we  saw 
in  Chapter  4,  tags  may  interact  during  rewriting.  In  particular,  this  happens  with  the  streaming 
and  parallelization  tags  as  well  as  with  the  large  packet  parallelization  and  the  vectorization  tag. 

To  resolve  these  issues,  we  group  tags  into  two  tag  spaces:  user  tags  and  system  tags.  As  their 
name  implies,  the  former  are  specified  by  the  user,  while  the  latter  are  automatically  introduced 
by  the  system  as  a  part  of  the  rewriting  process.  Note  that  users  can  also  specify  system  tags  if 
needed:  this  gives  the  user  the  flexibility  to  specify  as  little  or  as  much  as  possible,  and  allows  the 
system  to  automatically  fill  in  (deterministically  or  through  experimental  evaluation)  the  rest  of 
the  required  tags  and  parameters.  Below,  we  describe  these  two  tag  spaces  and  how  our  tagging 
system  works. 

User  tags.  User  tags  are  those  specified  by  the  user.  The  set  of  available  user  tags  are  shown 
in  Table  5.1,  along  with  the  required  parameters  and  defaults  when  the  tag  is  not  included  in 
the  specifications.  User  tags  have  the  subscript  “u.”  Available  tags  include  the  parallelization 
tag  which  specifies  that  the  user  wants  parallelization  (and  optionally,  for  a  specified  number  of 
processors),  and  the  memory  tag,  which  specifies  that  the  input  and  output  vectors  are  in  main 
memory. 

User  tags  can  be  thought  of  as  “higher- level”  tags,  as  they  allow  the  user  to  minimally  specify 
the  requirement  at  a  high  level.  They  do  not  rewrite  the  transform  itself,  but  rather  get  replaced 
first  with  system-level  tags.  User  tags  can  be  specified  in  any  order. 

System  tags.  System  tags  are  automatically  generated  by  our  system  using  information  from 
user  tags  and  the  architecture  specification.  The  set  of  available  system  tags  is  shown  in  Table  5.2. 
All  these  were  introduced  in  Chapter  3.  Note  that  the  user  can  also  create  a  more  detailed,  lower 
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Tag 

Description 

Parameters 

A 

Use  small  packet  parallelization 

P’fJ- 

par(p,p) 

A 

Use  large  packet  parallelization 

P 

parBigCp) 

A 

Use  streaming  (for  throughput  and  large-sizes) 

V 

streain(i/f) 

StreamAndParO 

Use  both  streaming  and  parallelization  (using  ParStreamCore  or 
StreamParChip) 

none 

Table  5.2:  System  tags. 

level  requirement  specification  by  using  any  or  all  of  the  system  tags. 

Rewriting  expressions  to  remove  the  individual  parallelization  and  streaming  tags  were  dis¬ 
cussed  in  Chapter  3.  The  last  tag  in  Table  5.2  is  rewritten  with  the  following  two  rules: 


StreamAndParO 


StreamAndParO  StreamAndParO 


StreamAndParO 


search 


parstreamcoreO  streamparchipO 


A  implies  that  our  system  searches  for  the  best  among  tbe  choices  on  tbe  right  hand 
side.  The  second  rule  above  bas  two  choices.  Note  that  large  packet  parallelization  (  A  )  does 

parBig(p) 

not  work  in  conjunction  with  streaming. 

Systems  tag  nesting.  For  the  rewrite  system  to  work  correctly  and  find  the  set  of  applicable 
rules  at  any  point  during  the  rewriting,  our  system  tags  must  be  nested  correctly.  Tags  can  be 
classified  based  on  the  three  paradigms  we  us: 


1. 

2. 

3. 

4. 


Vectorization: 

vec(v) 

Parallelization: 

par(p,p)  parBig(p)  parBig(p,v) 

Streaming: 

stream(i//)  mem(s) 

Both  parallelization  and  streaming: 


parstreamcoreO  streamparchipO  StreamAndParO 


System  tags  are  nested  thus: 
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1.  The  relative  nesting  of  any  of  the  streaming  and  parallelization  tags  is  determined  during 
rewriting  based  on  whether  the  StreamAndParO  tag  is  rewritten  into  a  StreamParChipO  tag 
or  a  ParStreamCoreO  tag,  as  explained  in  Section  4.3.2. 

2.  The  vectorization  tag  is  always  introduced  at  the  innermost  nesting  level,  as  it  must  he  the 
last  one  to  he  removed  hy  the  rewriting  system. 

5.1.2  Architecture  specification 

Our  system  requires  the  following  knowledge  about  the  target  architecture: 

•  maximum  number  of  of  PEs  available 

•  local  memory  size  determines  the  largest  transform  C  that  will  fit  into  local  memory 

•  presence  of  a  memory  node  (yes/no) 

•  minimum  packet  size  to  use  for  inter- PE  communication  (p) 

•  minimum  packet  size  to  use  for  memory- PE  communication  (ip) 

In  addition,  code  for  the  following  are  required  for  generating  output  code: 

•  non-blocking  send  or  receive 

•  memory  fence 

•  synchronization  barrier 

The  SIMD  instruction  set  is  also  required  as  explained  in  [Franchetti  et  al.,  2006b]. 

5.1.3  Problem  specification 

In  addition  to  the  transform  specification  and  the  architecture  specifications,  the  following  are 
required  as  input  to  our  system  as  a  part  of  the  problem  specification: 

•  Type  of  code  desired:  fixed  size  or  general  size 

•  Data  distribution  (block  distributed,  block  cyclic) 
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5.1.4  Generating  System  Specifications  from  User  Specifications 

We  use  a  combination  of  the  user-tagged  transform,  the  architecture  specification,  and  the  prob¬ 
lem  specification  to  generate  a  system  specification.  A  system  specification  consists  of  the  trans¬ 
form  with  system  tags.  The  following  steps  convert  the  user  input  to  a  system-tagged  transform: 


1.  Streaming:  if  the  architecture  specification  shows  that  the  input  transform  is  too  large  to 

be  held  in  either  one  or  multiple  PEs  (depending  on  the  presence  of  the  tag), 

parallelizBu  (p) 

then  a  streaming  or  a  streaming  and  parallel  tag  is  added: 

if  no  parallelization  was  specified 
,  if  parallelization  was  specified 

t  StreamAndParO 

In  the  above,  y/  is  obtained  from  the  architecture  specification. 

2.  Parallelization:  If  a  tag  was  not  added  in  the  previous  step,  and  the  user  has 

StreamAndParO 

specified  a  tag,  then  the  following  parallel  tags  are  searched  over: 

parallelizeu(p) 

,  search  ,  ,  .  , 

parallelizeu(p)  par(p,p)  parBig(p) 

In  the  above,  p  is  obtained  from  the  architecture  specification.  Note  that  if  p  is  left  un¬ 
specified,  then  all  valid  values  of  p  that  will  hold  the  transform  (as  determined  from  the 
architectural  specifications)  are  searched  over  for  the  optimal  one. 

3.  Memory  tag:  the  memory  tag  is  handled  thus: 


memoryu  0 


A,  if  any  of  the  stream  tags  exist 

(or  ),Sdt{^7)^Gdt(^7)} 

stream(i^)  StreamAndParO 


In  other  words,  vectors  are  assumed  to  reside  in  main  memory  if  any  of  the  streaming  tags 
exist.  If  not,  vectors  can  either  be  streamed  in  from  main  memory,  or  can  simply  be  read 
and  written  to  main  memory  (without  streaming)  using  explicit  memory  instructions,  and 
our  system  searches  over  these  options  for  the  best  solution. 
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Problem  specifications.  To  handle  various  data  distributions,  our  system  simply  adds  ap¬ 
propriate  permutations  around  the  user  specified  transform  based  on  the  problem  specification. 
Currently,  block  and  block  cyclic  data  distributions  are  supported. 


5.2  Algorithm  Generation  and  Rewriting  for  Paradigm 

In  the  previous  section,  we  discussed  how  to  generate  a  system-tagged  SPL  specification  from 
the  user-tagged  transform,  problem  specification,  and  architecture  specifications.  This  section 
explains  the  second  stage  in  Figure  5.1  which  is  the  core  of  our  system:  it  produces  a  platform- 
adapted  algorithm  specified  in  Z-SPL.  Although  the  main  ideas  in  this  process  were  covered  in 
Chapter  3  and  Chapter  4,  this  section  serves  as  a  wrapper  to  those  chapters,  and  provides  the 
necessary  implementation  details.  Since  the  program  generation  is  significantly  different  for 
fixed  size  and  general  size  code,  we  discuss  them  separately  below. 

5.2.1  Fixed  Size  Program  Generation 

We  describe  the  steps  involved  in  generating  fixed  size  FFT  code.  This  step  is  the  second  stage 
shown  in  Figure  5.1.  We  note  the  degrees  of  freedom  involved  in  each  step,  if  any.  We  first  de¬ 
scribe  our  work,  and  then  briefly  discuss  how  it  composes  with  previous  work  that  targets  SIMD 
vectorization  and  the  memory  hierarchy. 

Step  1:  Algorithm  generation  from  system-tagged  SPL.  Algorithm  breakdown  rules  are  applied 
to  convert  the  transform  into  tagged  SPL  constructs.  Algorithm  breakdown  rules  include 
the  Cooley- Tukey  FFT,  the  vector  recursion  rule,  the  row-column  algorithm,  and  other  2D 
breakdowns  covered  in  Chapter  4.  In  addition,  the  type  of  parallelization  that  will  be  used 
(small  packet  or  large  packet  parallelization)  is  also  decided  in  this  step,  since  the  two  types 
of  parallelization  cannot  be  mixed  with  each  other.  Applying  algorithm  breakdown  rules 
followed  by  applying  the  appropriate  paradigm  rules  allows  our  system  to  automatically 
arrive  at  a  variant  FFT  algorithm  that  is  well  suited  to  the  target  architecture.  Degrees  of 
freedom  (all  subject  to  rule  constraints):  the  number  of  stages  as  decided  by  the  applica¬ 
tion  of  the  vector  recursion  rule,  or  any  of  the  2D  multi-stage  rules;  choice  of  paralleliza¬ 
tion. 

Step  2:  Z-SPL  generation  from  algorithm.  The  tagged  SPL  constructs  are  converted  into  Z-SPL 
using  the  rewrite  rules  presented  in  Table  3.1  and  Table  3.2.  The  Z-SPL  versions  are  map- 
pable  to  the  target  paradigms.  At  this  stage,  constructs  that  are  tagged  both  stream  and 
par  may  have  a  degree  of  freedom,  and  are  re-tagged  with  either  the  ParStreamCore  rule 
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or  the  StreamParChip  rule.  Degrees  of  freedom  (subject  to  rule  constraints):  selection  of 
ParStreamCore  or  StreamParChip.  At  the  Z-SPL  level,  the  system  performs  formal  loop 
merging  based  on  previous  work  in  Franchetti  et  al.  [2005],  and  on  rules  from  our  work 
such  as  and . 

Search.  Steps  1  and  2  above  contain  degrees  of  freedom  which  exist  because  we  do  not  know 
beforehand  the  options  that  will  lead  to  the  best  performance.  We  handle  such  degrees  of  free¬ 
dom  simply  by  generating  code  for  all  the  options  available  and  searching  for  the  fastest  code. 
Since  the  degrees  of  freedom  are  limited,  the  search  space  is  small  and  tractable. 

Optimizing  for  other  architectural  paradigms.  Our  work  composes  with  previous  work  on 
vectorization  and  blocking  for  the  memory  hierarchy.  All  our  evaluation  platforms  incorporate 
SIMD  vectorization,  handled  using  the  work  in  [Franchetti  et  al.,  2006b].  In  addition,  PEs  on 
MPI  systems  typically  have  a  deep  local  memory  hierarchy,  and  the  Cell  can  be  considered  to 
have  a  single  memory  hierarchy  layer  consisting  of  register  space.  To  optimize  for  the  memory 
hierarchy,  previous  work  performs  a  search  over  a  larger  search  space  at  the  lower  recursion 
levels  of  the  algorithm  breakdown  using  dynamic  programming,  as  discussed  in  [Piischel  et  al., 
2005]. 


5.2.2  General  Size  Library  Generation 

We  describe  the  steps  involved  in  generating  general  size  libraries  based  on  our  description  in 
Section  2.4.4.  Note  that  we  only  use  the  general  size  library  generation  system  to  produce  li¬ 
braries  for  our  large  packet  parallelization  algorithm.  Streaming  and  small  packet  parallelization 
are  currently  not  implemented  for  the  general  size  case. 

In  the  previous  section,  we  showed  how  to  rewrite  the  transform  based  on  a  set  of  rules  to  ob¬ 
tain  a  variant  FFT  that  is  suited  to  the  target  platform.  The  overall  idea  in  generating  general  size 
code  is  analogous:  we  feed  this  set  of  rules  into  Autolib.  However,  unlike  the  program  generation 
process  for  fixed  size  code,  the  result  is  a  set  of  Z-SPL  expressions  (as  opposed  to  an  algorithm  in 
Z-SPL)  which  represent  the  recursion  step  closure  (as  described  in  Chapter  2  and  in  [Voronenko, 
2008])  for  the  input  rules.  When  translated  into  mutually  recursive  functions,  the  recursion  step 
closure  forms  the  general  size  library. 

Instead  of  allowing  Autolib  to  generate  the  entire  variant  FFT,  for  the  purpose  of  simplicity, 
we  manually  apply  (4.12)  to  the  DFT,  and  mark  parts  of  it  which  translate  directly  into  MPI  code 
or  local  permutations,  and  only  feed  in  the  remaining  computation  stages  into  Autolib.  The 
entire  process  is  shown  in  Figure  5.2,  and  described  below. 
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All 

All 

All 

to 

Comp  2 

to 

Comp  1 

to 

all 

all 

all 

Q 

(local) 


void  dft_plan_wrapper(plan  *pLeftj  pla 

{  dft_plan_left(pLeft); 
dft_plan_right(pRight); 

} 

void  dft_compute_wrapper(complex  *\,  ^mplex  *X 

-►  {  memcpy(...)  //  local  permutation 
MPI_Alltoall( . . . ); 
dft_compute_left(pLeft j  Yj  X) 

MPI_Alltoall( . . . ); 
dft_compute_right(pRight j  Yj  X) 
MPI_Alltoall( . . . ); 
memcpy(...)  //  local  permutation  Q 

} 


Figure  5.2:  Generating DFTs  in  Autolib. 


Step  1:  Apply  (4.12)  to  the  DFT.  As  shown  in  (4.12),  this  produces  seven  stages  including  two 
computation  stages,  three  global  communication  stages,  and  two  local  permutation  stages. 

Step  2:  For  each  of  the  two  computation  stages,  we  input  the  computation  (including  the  sur¬ 
rounding  local  permutations)  into  the  Autolib  system.  A  separate  library  is  produced  for 
each  of  these  computation  stages. 

Step  3:  Wrapper  code  implements  the  entire  transform  by  including  the  local  permutations  (im¬ 
plemented  using  memcpy ()s),  global  permutation  (implemented  using  MPI_Alltoall), 
and  calls  to  each  of  the  two  libraries  that  implement  the  computation. 

The  process  works  similarly  for  2D  DFTs  and  other  variations  based  on  data  distributions. 

Note  that  the  resulting  code  may  have  fewer  communication  or  permutation  stages. 

Search.  Search  in  general  size  library  generation  is  conducted  by  the  generated  code,  rather 

than  by  the  program  generation  system  as  is  the  case  for  fixed  size  code.  Autolib  thus  encodes 
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the  set  of  algorithms  required  and  the  degrees  of  freedom  into  the  library  itself,  as  discussed  in 
[Voronenko,  2008] . 

5.3  Code  Generation 

The  third  step  of  our  system  shown  in  Figure  5.1  is  to  convert  Z-SPL  generated  in  the  previous 
step  to  C  code,  in  the  fixed  input  size  case,  the  Z-SPL  generated  in  the  previous  step  is  the  actual 
algorithm,  in  the  general  input  size  case,  the  set  of  Z-SPL  expressions  generated  in  the  previous 
step  correspond  to  the  set  of  mutually  recursive  functions  that  are  required  for  the  implementa¬ 
tion.  In  either  case,  the  generated  Z-SPL  expressions  consist  of  regular,  parallel,  streaming,  and 
vector  loops  which  can  he  readily  translated  into  C  code.  Program  generation  is  performed  based 
on  Table  2.4.  Explicit  memory  access  instructions  such  as  MPI  and  DMA  are  generated  based  on 
translation  rules  discussed  in  Chapter  3. 

5.4  Low  Level  Concerns 

In  this  section,  we  address  several  concerns  associated  with  low  level  implementation  details. 
We  first  present  several  architectural  constraints  on  the  Cell  and  our  solutions,  followed  by  issues 
with  computing  or  loading  twiddle  factors,  and  conclude  with  a  brief  discussion  of  synchroniza¬ 
tion  and  memory  barriers  required  for  parallelization  and  streaming. 

5.4.1  Cell  Platform  Constraints 

The  Cell  BE  has  several  platform  imposed  restrictions  which  our  system  considers.  We  discuss 
them  briefly  here. 

Temporary  space.  Since  the  Cell  has  limited  local  memory  (256kB  local  store  per  SPE),  to 
minimize  the  temporary  space  required  by  our  code,  we  use  a  simple  version  of  the  memory 
arena  technique  described  in  [Hanson,  1989].  Before  generating  C  code,  our  system  determines 
the  peak  memory  storage  required  by  the  DET  function  at  any  given  point  in  the  code,  and  allo¬ 
cates  a  memory  arena  of  this  size.  Temporary  space  is  then  allocated  and  deallocated  from  this 
arena,  thus  minimizing  the  amount  of  total  temporary  space  required. 

DMA  alignment  constraints.  The  Cell  has  alignment  constraints  for  DMA  instructions.  Ef¬ 
fectively,  we  cannot  perform  DMAs  under  16  bytes,  and  can  perform  DMAs  only  in  multiples  of 
16  bytes.  We  can  also  perform  DMA  only  on  naturally  aligned  data.  For  two  power  sizes  and 
vectorizable  non-2-power  sizes  beyond  a  minimum  size,  our  packet  size  restrictions  guarantee 
that  DMA  instructions  are  aligned  correctly.  For  other  odd  sizes  which  we  do  not  address  in  this 
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thesis,  alignment  issues  may  be  solved  with  techniques  presented  in  [Franchetti  and  Ptischel, 
2007], 

Memory  banks  and  iteration  skewing.  The  Cell’s  main  memory  is  divided  into  16  banks  for 
performance  reasons.  Memory  addresses  are  striped  cyclically  across  these  banks  at  a  128  byte 
block  size.  In  other  words,  memory  addresses  corresponding  to  the  bytes  0  to  127  are  assigned  to 
bank  0,  addresses  corresponding  to  bytes  128  to  255  are  assigned  to  bank  1,  and  so  on.  Accessing 
memory  addresses  from  only  a  subset  of  the  banks  results  in  a  reduced  effective  bandwidth, 
since  the  architecture  relies  on  a  data  access  pattern  that  evenly  uses  all  banks  to  achieve  the  full 
main  memory  bandwidth  of  25.6  GB/s.  However,  when  our  parallelized  streaming  algorithms 
access  main  memory  at  large  two-power  strides,  they  may  end  up  accessing  only  a  subset  of  the 
banks. 

To  address  this  issue  and  increase  performance  in  this  situation,  we  use  a  simple  iteration 
skewing  technique.  With  the  natural  strides  found  in  the  FFT,  as  the  SPEs  progress  through  their 
loops,  they  all  access  the  same  bank  at  the  same  time.  Instead,  we  skew  the  iterations  with  re¬ 
spect  to  the  SPEs  in  such  a  way  that  while  the  first  SPE  is  working  on  its  first  iteration  of  the  loop, 
the  second  SPE  is  working  on  its  second  iteration,  and  so  forth.  This  evens  out  the  bank  access 
pattern  and  increases  the  effective  main  memory  bandwidth. 

[Chow  et  al.,  2005]  addresses  the  same  problem  by  offsetting  the  imaginary  parts  of  the  com¬ 
plex  input  and  output  arrays  in  such  a  way  that  the  real  and  imaginary  parts  are  always  mapped 
to  different  memory  banks.  Although  this  allows  the  authors  to  achieve  higher  performance,  the 
technique  imposes  a  data  layout  on  the  application,  in  addition  to  being  restricted  to  a  split- 
complex  input  format.  Our  technique  suffers  from  neither  drawback. 

5.4.2  Twiddle  Factors 

All  DFT  computations  involve  multiplications  with  twiddle  factors.  Since  generating  twiddle  fac¬ 
tors  when  computing  a  DFT  is  highly  expensive  (because  of  the  sine  and  cosine  computations 
involved).  Spiral’s  previous  approach  involved  pre-computing  and  storing  twiddle  factors  at  li¬ 
brary  initialization  time.  This  means  that  each  twiddle  factor  is  simply  loaded  from  memory 
when  required. 

When  using  recursive  DFTs,  several  twiddle  factor  tables,  varying  from  small  to  large  (corre¬ 
sponding  to  the  various  problem  sizes  in  the  recursion  tree)  may  be  required.  In  a  distributed 
memory  environment,  the  smaller  of  these  tables  can  easily  be  precomputed  and  replicated 
across  each  PE.  However,  this  strategy  does  not  work  for  larger  twiddle  tables  that  either  do  not 
fit  in  local  memory,  or  are  large  enough  to  be  different  for  each  PE.  We  discuss  our  approach  to 
this  issue  here. 
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Distributing  twiddle  factors.  We  handle  the  precomputed  twiddles  table  for  the  topmost 
(largest)  problem  size  in  the  recursion  tree  by  distributing  it  across  the  PEs  when  the  fit  in  local 
memory.  Each  PE  computes  only  its  portion  of  the  twiddles  table  at  library  initialization  time. 

Streaming  in  twiddle  factors.  For  architectures  such  as  the  Cell  with  limited  local  memory 
but  larger  main  memory,  we  store  the  precomputed  twiddle  factors  in  main  memory,  and  stream 
them  in  along  with  parts  of  the  input  vector  when  needed.  We  use  a  double  buffering  technique, 
similar  to  our  streaming  paradigm,  to  load  twiddle  factors  in  the  background,  one  loop  iteration 
ahead,  to  minimize  the  impact  of  such  loading  on  run  time.  However,  a  part  of  the  memory 
bandwidth  is  used  up  by  the  twiddle  factors,  which  can  impact  run  time  on  on  platforms  where 
memory  bandwidth  bounds  performance. 

Online  computation  of  twiddle  factors.  When  memory  bandwidth  bounds  performance, 
this  means  that  we  potentially  have  unused  compute  cycles  available  to  us,  while  bandwidth 
comes  at  a  premium,  in  this  case,  one  solution  is  to  pre-compute  small  seed  tables  that  fit  in 
local  memory  space,  but  need  additional  computation  (which  must  be  relatively  inexpensive)  for 
conversion  into  actual  twiddle  factors.  Although  we  do  not  currently  implement  this  technique 
in  our  system,  this  may  potentially  increase  performance  by  a  factor  that  can  be  estimated  using 
our  formula  in  Section  4.3.3. 

5.4.3  Synchronization  Barriers 

Our  output  code  requires  the  use  of  two  types  of  barriers:  synchronization  barriers  and  memory 
barriers  (also  known  as  memory  fences). 

Synchronization  barriers.  Synchronization  barriers  are  a  point  in  code  at  which  all  PEs  must 
arrive  before  any  can  proceed  past  them.  Barriers  are  required  whenever  PEs  end  up  using  data 
generated  by  other  PEs.  We  identify  and  mark  the  location  of  barriers  at  a  high  level  using  rewrite 
rules  shown  in  Table  3.1. 

Ideally,  our  barriers  must  be  fast,  so  as  to  not  impact  performance  by  adding  to  the  run  time. 
We  need  as  many  barriers  as  we  have  compute  stages,  which  means  we  need  three  barriers  to 
compute  a  single  DFT  for  any  of  the  two  parallel  DPT  algorithms  in  this  thesis.  Fast  barrier  im¬ 
plementations  may  already  exist  on  a  platform.  MPI  defines  a  barrier  call,  MPI_barrier  ( )  which 
is  typically  already  optimized  for  a  given  platform.  On  the  Cell,  we  use  mailbox  communications 
to  synchronize  across  the  SPEs.  Our  synchronization  barriers  cost  between  600  and  1200  cycles, 
depending  on  the  number  of  SPEs  being  used. 

Memory  barriers.  Memory  barriers  are  points  in  code  where  a  specified  set  of  outstand¬ 
ing  data  transfer  requests  that  were  made  at  an  earlier  point  using  non-blocking  calls,  must  be 
completed  before  execution  can  proceed.  Architectures  that  support  non-blocking  data  trans- 
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fer  requests  also  have  programming  interfaces  to  issue  memory  barriers.  This  makes  generating 
these  simple,  once  the  points  at  which  they  must  be  inserted  have  been  identified.  These  points 
marked  up  during  algorithm  generation,  based  on  the  streaming  rewrite  rules,  as  shown  in  Ta¬ 
ble  3.2. 


5.5  Example 

In  this  section,  we  provide  a  small  example  to  show  how  the  program  generation  system  works. 
We  illustrate  our  example  in  Figure  5.3,  and  discuss  its  progress  through  the  stages  of  our  pro¬ 
gram  generation  system  (shown  in  Figure  5.1). 

Input.  We  specify  the  input  to  be  a  DFT  transform  of  size  2^®,  and  request  it  to  be  parallelized 
across  all  available  PEs.  Our  problem  specification  consists  of  requesting  single  precision  fixed 
size  code  using  a  regular  data  distribution.  Our  architecture  specification  is  that  of  a  Cell  BE 
with  four  available  SPEs.  Given  the  size  of  the  local  stores  of  the  Cell  BE,  we  compute  C  =  2^^. 
Based  on  empirical  data  about  the  interconnect  performance.  We  also  determine  values  for  p 
(minimum  packet  size  for  inter-SPE  data  transfers)  and  y/  (minimum  packet  size  for  chip  to  main 
memory  data  transfers)  and  include  this  in  the  architecture  specification.  We  also  specify  the 
DMA  instruction  set. 

System  specification  generation.  As  shown  in  Figure  5.3,  based  on  our  architecture  specifi¬ 
cation,  our  system  determines  that  we  must  parallelize  across  four  SPEs,  and  that  we  also  need 
to  stream  the  data  from  main  memory  as  it  will  not  fit  on-chip.  Consequently,  a  StreamAndParO 
tag  is  added  with  the  appropriate  parameters.  A  SIMD  vector  tag  is  also  added  by  default.  The 
end  result  is  the  transform  with  all  required  tags  correctly  nested. 

Algorithm  generation  and  rewriting  for  paradigm.  Breakdown  rules  are  applied  to  our 
transform.  The  system  determines  that  we  must  use  a  three-stage  streaming  algorithm.  One  set 
of  possible  parameters  for  the  ID  DFT  three-stage  algorithm  is  shown  in  Figure  5.3.  The  system 
searches  for  the  fastest  code  over  other  parameters  and  alternatives  including  the  ParStreamCore 
and  StreamParChip  versions  of  each  of  the  factors. 

Code  generation.  The  output  of  the  previous  stage  is  an  algorithm  in  Z-SPL  (not  shown). 
This  is  converted  into  C  code  using  the  architecture  specification  to  generate  platform-specific 
calls. 

Output.  The  final  output  is  the  fastest  code  found  over  searching  alternatives,  and  tailored 
to  the  input  specifications. 
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Figure  5.3:  Program  generation  example. 
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Chapter  6 


Experimental  Results 


In  this  chapter,  we  present  the  performance  of  the  DPT  code  generated  hy  our  system.  We  first 
present  our  experimental  setup.  Then,  we  present  the  performance  of  our  generated  single- 
PE  vectorized  kernel,  which  serves  as  a  baseline  reference  for  the  remaining  results.  For  the 
remaining  sections,  results,  we  follow  the  general  outline  of  Chapter  4:  we  first  present  perfor¬ 
mance  results  for  latency  optimized  distributed  DFTs,  followed  by  latency  optimized  distributed 
DFTs  streamed  from  main  memory,  and  conclude  with  the  performance  of  throughput  opti¬ 
mized  DFTs. 


6.1  Experimental  Setup 

We  evaluate  our  generated  single  precision  ID  and  2D  DFT  implementations  on  the  Cell  BE  pro¬ 
cessor,  and  on  three  MPI  clusters:  the  Axon,  the  Cray  XT4,  and  the  Cray  XT5.  The  architectural 
details  of  these  platforms  were  discussed  in  Chapter  3.1.  We  present  other  platform  details  and 
software  details  below. 

Cell  processor.  We  used  a  single  Cell  processor  of  an  IBM  Cell  Blade  QS20  running  at  3.2 
GHz.  Our  code  was  compiled  with  the  gcc  compiler  for  the  SPU,  spu-gcc,  from  the  Cell  SDK 
version  3.0. 

Axon.  On  the  Axon,  we  use  version  4.1.2  of  the  gcc  compiler  to  compile  our  generated  li¬ 
braries  with  the  following  flags:  -msseS  -std=c99  -03  -f strict-aliasing.  The  Axon  uses 
OpenMPI  version  1.4.1,  which  is  an  MPI-2  implementation. 

CrayXT4  andXTS.  On  the  CrayXT4  andXTS  machines,  we  use  version  4.3.3  of  the  gcc  com¬ 
piler  with  the  following  flags: -msse3  -std=c99  -03  -f  strict-aliasing.  Both  machines  use 
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MPICH,  which  is  an  MPI-2  implementation. 

Performance  measure.  All  our  plots  report  FFT  performance  in  pseudo-giga-floating  point 
operations  per  second  (pseudo  gflop/s),  which  is  essentially  inverse  runtime,  defined  as: 

5Ailog2 (AO /(runtime  [s]  •  10^), 

where  N  is  the  size  of  the  ID  or  2D  DFT  kernel  in  complex  samples  (e.g.,  for  a  DFl^xt.  N  =  k^]. 
This  metric  is  based  on  the  asymptotic  operation  count  of  the  radix-2  Cooley- Tukey  algorithm. 
As  discussed  in  Section  4.2.5,  this  is  an  approximate  and  slight  overestimate  of  the  exact  oper¬ 
ations  count  which  depends  on  the  exact  FFT  used,  as  presented  in  [Frigo  and  Johnson,  2005] . 
However,  this  metric  provides  us  with  a  convenient  way  to  measure  run  time  and  compare  it 
across  platforms.  Higher  is  better  in  all  plots.  We  show  only  the  forward  transform’s  perfor¬ 
mance.  The  performance  of  the  inverse  transform  is  typically  the  same. 

Timing  methodology.  For  experiments  that  measured  the  performance  of  latency-optimized 
kernels,  we  measured  the  execution  time  of  a  single  kernel.  We  timed  an  adequate  number  of 
iterations  to  ensure  timing  stability  and  precision.  We  measured  the  performance  of  throughput- 
optimized  kernels  by  running  several  iterations  and  measuring  the  runtime  of  the  steady  state  of 
the  computation. 

Timing  mechanism.  For  obtaining  timing  information  on  the  Cell  BE,  we  use  the  decre- 
menters  available  on  the  SPE.  On  all  MPI  platforms,  we  use  the  timing  mechanism  provided  by 
MPI  by  calling  MPI_Wtinie  () . 

FFTW  and  flags  used.  We  used  FFTW  version  3.3alphal,  which  is  a  preview  version  of  the 
new  MPI  interface.  Previously,  basic  MPI  support  was  available  in  FFTW  2.1.5,  which  had  sev¬ 
eral  limitations,  and  also  provided  poor  performance  in  our  tests  as  compared  to  the  3.3alphal 
version. 

When  we  benchmark  FFTW,  we  use  its  default  planner  flag,  which  is  the  FFTW_MEASURE 
flag  (see  [FFTW,  2008b]  for  details).  With  this  option,  FFTW  measures  the  run  time  of  FFTs  on 
the  target  platform,  and  uses  these  measurements  to  guide  its  search.  FFTW_PATIENT,  which 
promises  a  larger  search  space  was  not  used  because  of  the  time  limitations  on  the  supercom¬ 
puting  platforms.  We  note  that  we  ran  FFTW  with  FFTW_PATIENT  on  a  random  sample  of  prob¬ 
lems  on  each  of  our  target  platforms,  and  found  that  the  performance  results  obtained  were 
never  significantly  better  (within  4%)  than  those  obtained  when  using  FFTW_MEASURE. 
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DFT  on  a  single  SPE  (2-powers,  LS-LS) 
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Figure  6.1:  Baseline:  single-SPE  performance  on  the  CeUBE. 


6.2  Baseline  Single-PE  Performance 


We  first  show  the  performance  of  our  libraries  running  on  a  single  PE  on  each  of  the  platforms. 
This  serves  as  a  baseline  for  comparison  for  the  parallelized  and  streamed  DFTs. 


6.2.1  Cell 

SIMD  Vectorized  DFT  on  a  single  SPE.  Figure  6.1(a)  shows  the  single-core  performance  of  our 
generated  DFT  kernels  for  2-power  input  sizes  for  single  precision  (labelled  “float”)  and  double 
precision  (labelled  “double”)  data  resident  in  the  local  store.  These  vectorized  DFT  kernels  are 
important  building  blocks  in  all  our  parallel  and  streaming  implementations.  We  show  perfor¬ 
mance  for  both  the  interleaved  complex  and  the  split  complex  data  format  to  make  a  fair  com¬ 
parison  to  other  libraries.  Our  generated  code  achieves  16-21  Gflop/s  for  single  precision,  and  8 
Gflop/s  for  double  precision. 

In  Figure  6.1(b),  we  compare  the  performance  of  our  single-precision  split- complex  kernels 
to  hand-tuned  code  from  Mercury  [Gico  et  al.,  2006],  Ghow  [Ghow,  2008],  and  the  IBM  SDK 
FFT  library  [IBM,  2008]  as  measured  in  [Ghow,  2008] .  The  maximum  performance  achieved  by 
our  single  precision  generated  code  is  currently  slower  (0.73x-0.85x)  than  Mercury,  although  our 
code  achieves  its  performance  peak  for  smaller  sizes.  Our  code  is  faster  by  a  factor  of  1.06x-1.6x 
when  compared  to  Ghow’s  library,  and  1.83x-3x  faster  than  IBM’s  SDK. 
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I D  DFTs  on  the  Axon:  Single*PE 

Performance  [pseudo  Gflop/s] 


I D  DFTs  on  the  Cray  XT4:  Single>PE 

Performance  [pseudo  Gflop/s] 


Problem  size 


Problem  size 


(a)  (b) 

Figure  6.2:  Baseline:  Single  PE  performance  on  MPI  platforms. 


6.2.2  Clusters 

In  Figure  6.2(a),  we  show  the  performance  of  Spiral  generated  code  and  that  of  FFTW  running 
on  a  single  PE  on  the  Axon.  Although  Spiral  generated  code  performs  better  than  FFTW  for  in¬ 
cache  sizes  (less  than  size  2^^),  these  are  largely  irrelevant  for  comparisons  with  large  parallelized 
FFTs,  because  large  FFTs  primarily  use  large,  out-of-cache  local  FFTs. 

In  Figure  6.2(b),  we  show  the  performance  of  Spiral  generated  code  and  FFTW  on  a  single 
PE  of  the  Cray  XT4.  Performance  for  the  Cray  XT5  is  similar. 


6.3  Latency  Optimized  Distributed  DFTs 

We  present  the  performance  of  our  latency  optimized  distributed  on  each  platform.  In  this  sec¬ 
tion,  we  only  consider  problem  sizes  that  fit  within  the  combined  local  memories  of  the  PEs. 

6.3.1  Cell 

Parallel  DFT  across  multiple  SPEs,  data  in  locai  stores.  Figure  6.3(a)  and  (b)  display  the  perfor¬ 
mance  of  our  generated  parallel,  multi-SPE  DFT  kernels,  with  data  resident  in  the  local  stores. 
These  kernels  use  single- SPE  vectorized  kernels  as  building  blocks.  We  generate  and  evaluate 
code  for  two  standard  data  distributions:  Figure  6.3(a)  shows  results  for  the  standard  block  dis¬ 
tribution  (i.e.,  data  is  cut  into  p  contiguous  blocks  for  p  SPEs).  Figure  6.3(b)  shows  results  for 
data  in  the  block  cyclic  format,  i.e.,  data  is  distributed  across  the  SPEs’  local  stores  in  a  round- 
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DFT  on  multiple  SPEs  (LS-LS) 
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DFT  on  multiple  SPEs  (LS-LS,  block-cyclic) 


(a)  (b) 

DFT  on  Multiple  SPEs  (Mem-mem) 

Performance  [pseudo  Gflop/s] 
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Figure  6.3:  Latency  optimized  parallel  DFTs  on  the  Cell  BE. 


robin  fashion  using  a  suitable  block  size.  We  achieve  close  to  80  Gflop/s  for  a  2^^ -point  DFT 
across  8  SPEs  using  a  block  cyclic  format,  and  about  53  Gflop/s  for  a  2^'^-point  DFT  with  a  stan¬ 
dard  distributed  format.  The  block  cyclic  data  distribution  requires  less  communication,  and 
thus  achieves  better  performance  compared  to  the  standard  distribution. 

DFT  on  multiple  SPEs,  data  in  main  memory.  Figure  6.3(c)  displays  the  performance  of 
parallelized  DFT  kernels  with  data  beginning  and  ending  in  main  memory.  We  compare  our 
generated  code  to  FFTW  [Frigo  and  Johnson,  2005]  and  FFTG  [Bader  and  Agarwal,  2007] ,  with 
performance  data  taken  from  [FFTW,  2008a]  and  [Bader  and  Agarwal,  2007] ,  respectively.  Note 
that  both  implementations  only  provide  latency-optimized  DFT  functions.  To  make  a  fair  com¬ 
parison  with  FFTW  and  FFTG,  we  specified  a  memO  tag,  which  simply  adds  DMA  load  and  store 
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operations  around  our  parallel  multi-SPE  kernels  (shown  in  Figure  6.3(d)).  This  approach  does 
not  result  in  highly  optimized  code  as  communication  and  computation  are  not  performed  in 
parallel.  Despite  this  limitation,  as  shown  in  Figure  6.3(f),  our  generated  code  already  outper¬ 
forms  FFTW  by  a  factor  of  4.5x,  and  FFTC  by  a  factor  of  1.63x  (for  size  2^^). 

FFTC  is  limited  in  its  effectiveness  because  it  uses  a  single  algorithm,  hard-coded  for  the  5 
problem  sizes  shown.  It  is  also  hard-coded  to  use  8  SPEs.  FFTW  has  been  ported  to  the  Cell  pro¬ 
cessor  [FFTW,  2008a],  and  is  able  to  offload  computation  to  the  SPEs.  However,  FFTW  is  unable 
to  achieve  more  than  22  (pseudo)  Gflop/s  of  performance  for  any  transform  size  on  the  Cell  pro¬ 
cessor.  Also,  we  speculate  that  it  is  unable  to  gain  an  advantage  by  offloading  computation  to  the 
SPEs  for  transform  sizes  that  are  less  than  2P-,  limiting  performance  to  under  5  (pseudo)  Gflop/s 
for  smaller  sizes. 

6.3.2  Clusters 

In  Figure  6.4(a),  (c),  and  (e),  we  show  the  performance  of  ID  DFTs  on  the  GrayXT4,  GrayXT4,  and 
the  Axon,  respectively.  In  Figure  6.4(b),  (d),  and  (f),  we  show  the  corresponding  performance  of 
FFTW.  Performance  scales  well  on  all  these  platforms.  Our  code  achieves  up  to  12.5  Gflop/s  on 
32  PEs  of  the  Gray  XT4  (FFTW  achieves  up  to  13.8  Gflop/s),  28.4  Gflop/s  on  128  PEs  of  the  Gray 
XT5  (EETW  achieves  26.9  Gflop/),  and  over  14  Gflop/s  on  64  PEs  of  the  Axon  (EETW  achieves  16.2 
Gflop/s). 

For  a  given  number  of  PEs  on  the  Gray  XT4  and  XT5,  we  time  our  DFTs  using  two  config¬ 
urations  with  respect  to  the  number  of  nodes  versus  the  number  of  cores:  one  maximizes  the 
number  of  nodes,  while  the  other  maximizes  the  number  of  cores.  Applications  may  demand  a 
particular  configuration.  On  the  XT4  and  the  XT5,  surprisingly,  maximizing  the  number  of  nodes 
consistently  yields  better  performance  than  maximizing  the  number  of  cores  for  a  given  number 
of  PEs. 

In  Figure  6.5(a)  and  (c),  we  show  the  performance  of  our  generated  2D  DFTs  on  the  Gray 
XT4  and  Gray  XT5.  We  show  the  corresponding  performance  of  FFTW  in  Figure  6.5(b)  and  (d). 
As  of  date,  we  were  unable  to  obtain  performance  results  for  the  2D  DFTs  on  the  Axon  due  to 
unavailability  of  the  machine.  Again,  the  performance  of  our  2D  FFTs  scales  well,  though  they 
are  slightly  slower  than  FFTW.  Notice  that  2D  FFTs  are  faster  than  the  ID  FFTs.  This  because  they 
require  only  two  communication  stages  as  opposed  to  the  three  stages  required  by  the  ID  DFTs. 
Note  that  in  some  cases,  we  were  unable  to  obtain  FFTW  performance  for  the  larger  number  of 
PEs:  this  was  because  the  EFTW  planner  took  longer  to  complete  its  planning  than  the  system 
allowed.  The  Spiral  planner  executes  on  a  single  node  (as  opposed  to  FETW’s  planner  which 
uses  the  entire  system),  which  allows  it  to  execute  without  demanding  the  use  of  all  the  PEs. 


6.4.  Latency  Optimized  DFTs  Streamed  from  Memory 
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6.4  Latency  Optimized  DFTs  Streamed  from  Memory 

We  evaluate  latency  optimized  DFTs  streamed  from  main  memory  on  the  Cell.  In  this  section, 
we  only  consider  sizes  that  are  too  large  to  fit  on  the  combined  local  stores  of  the  SPEs. 

Streamed  ID  DFTs,  single  SPE.  In  Figure  6.6(a),  we  show  the  performance  of  our  ID  DFTs 
streamed  using  our  two  stage  algorithm  and  the  three  stage  algorithm.  The  three  stage  algorithm 
is  faster  for  most  cases.  Given  that  we  were  unahle  to  accurately  predict  the  optimal  number  of 
stages  for  a  given  size,  we  search  over  both  our  algorithms  to  find  the  best  code.  Note  that  the 
performance  of  both  these  algorithms  drops  with  an  increase  in  problem  size.  This  is  because  of 
the  decrease  in  data  transfer  packet  size  as  the  problem  size  increases. 

Streamed,  parallelized  ID  DFTs.  In  Figure  6.6(b),  we  show  the  performance  of  our  large  ID 
DFTs  streamed  from  main  memory  and  parallelized  across  4  SPEs.  We  compare  our  performance 
with  [FFTW,  2008a]  and  [Sacco,  2008].  Although  the  performance  of  our  generated  code  is  com¬ 
parable  with  both,  we  are  unable  to  generate  code  for  sizes  larger  than  2^°  with  our  three  stage 
algorithm,  while  FFTW  can  compute  sizes  up  to  2^^. 

Streamed  2D  DFTs,  single  SPE.  In  Figure  6.6(c) ,  we  show  the  performance  of  small-to-medium- 
sized  2D  DFTs  on  a  single  SPE.  The  data  begins  and  ends  in  main  memory  and  includes  problem 
sizes  which  cannot  be  held  completely  in  the  local  store.  We  use  our  two  stage  streaming  al¬ 
gorithm.  Data  thus  has  to  make  two  round  trips  from  memory  to  the  local  stores.  We  observe 
performance  of  up  to  1 1  Gfiop  /  s  for  sizes  that  allow  for  reasonably  large  DMA  packet  sizes.  Sizes 
that  use  smaller  packet  sizes  see  lower  performance. 

Streamed,  parallelized  2D  DFTs.  Next,  in  Figure  6.6(d),  we  evaluate  the  performance  of 
streamed,  parallelized  2D  DFTs  for  a  range  of  sizes.  These  are  also  streamed  from  main  memory 
using  our  streaming  algorithms,  but  in  addition,  are  parallelized  across  4  SPEs.  The  performance 
of  our  code  is  again  comparable  to  FFTW. 


6.5  Throughput  Optimized  DFTs 

In  Figure  6.7 (a),  we  measure  the  performance  of  small  DFT  kernels  (data  resides  in  main  memory 
but  each  DFT  can  fit  in  the  local  store  of  one  SPE)  in  throughput  mode.  An  independent  copy 
of  the  single-SPE  kernel  runs  on  each  SPE.  Memory  bandwidth  is  the  limiting  factor  once  more 
than  4  SPEs  are  being  used,  and  we  see  sustained  performance  of  up  to  50  Gfiop /s. 

Figure  6.7(b)  is  a  throughput-optimized  version,  and  achieves  about  30  Gflop/s  on  a  DFT  of 
size  2^^  on  4  SPEs.  It  streams  data  from  main  memory  to  and  the  local  stores,  and  converts  data 
into  a  block  cyclic  data  format  on  the  fly,  enabling  the  application  of  our  fastest  parallel  block- 
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cyclic  multi-SPE  DPT  kernels.  This  technique  does  not  scale  well  to  8  SPEs  due  to  the  decrease 
in  the  size  of  the  DMA  packet  size  as  the  number  of  PEs  increases. 


6.5.  Throughput  Optimized  DFTs 
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I D  DFTs  on  the  Cray  XT4:  Spiral 
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I D  DFTs  on  the  Axon:  Spiral 

Performance  [pseudo  Gflop/s] 
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I D  DFTs  on  the  Axon:  FFTW 
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Figure  6.4:  Latency  optimized  parallel  DFTs  on  MPl  platforms.  A  16x2  in  the  legend  means  that  16  nodes 
with  2  cores  per  node  (total  of  32  PEs)  were  used. 
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2D  DFTs  on  the  Cray  XT4:  Spiral 
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Figure  6.5:  Latency  optimized  parallel  2D  DFTs  on  MPI  platforms.  Problem  sizes  are  shown  with  a  single 
dimension  on  the  x-axis  for  readability.  For  even  powers  of  two,  a  size  shown  as  2^  corresponds  to  an  input 
size  of  2”  x2”  where  n-  N 12.  For  odd  powers  of  two,  2^  corresponds  to  2”  x2”’''^.  For  example,  a  problem 
size  shown  as  2^®  refers  to  a  2D  DFT  size  of  2®  x  2®,  and  2^^’  refers  to  a  size  of  2®  x  2®.  A  16x2  in  the  legend 
means  that  16  nodes  with  with  2  cores  per  node  (total  of  32  PEs)  were  used. 
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Figure  6.6:  Latency  optimized  large  DFTs  streamed  from  main  memory  on  the  Cell  BE. 
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Figure  6.7:  Throughput  optimized  DFTs  on  the  Cell  BE. 
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Conclusion 


We  restate  from  our  introduction: 

The  goal  of  this  thesis  is  the  computer  generation  of  high-performance  DFT  libraries 
for  distributed  memory  parallel  processing  systems,  given  only  a  high-level  descrip¬ 
tion  of  DFT  algorithms  and  a  set  of  platform  and  architecture  parameters. 

In  this  thesis,  we  have  presented  a  prototypical  system  that  achieves  this  goal.  Writing  and 
tuning  DFT  libraries  for  distributed  memory  platforms  is  both  important  and  difficult,  and  the 
main  contribution  of  this  thesis  is  the  framework  to  formalize  and  automate  the  process  and  its 
actual  implementation  as  a  program  generator. 

As  we  discussed  in  Chapter  1,  performance  increases  in  computer  platforms  for  the  foresee¬ 
able  future  are  expected  to  come  primarily  from  the  scaling  of  parallelization.  The  distributed 
memory  paradigm  is  a  vital  component  in  scaling  parallelization,  which  makes  the  paradigm  a 
likely  mainstay  for  the  foreseeable  future. 

The  key  observation  that  shaped  our  approach  and  solution  is  that  communication  costs  can 
dominate  run  time  and  lead  to  poor  performance.  Thus,  it  is  important  to  design  algorithms  that 
explicitly  manage  data  transfer  and  minimize  communication  costs,  which  we  accomplished  by 
addressing  all  the  main  facets  of  the  problem.  First,  we  design  parallel  algorithms  that  accom¬ 
plish  data  transfer  using  only  packets  large  enough  to  minimize  overhead  costs  and  maximize 
achieved  bandwidth.  Second,  we  explicitly  manage  data  transfers  and  perform  them  concur¬ 
rently  with  computation  (this  requires  architectural  support),  thus  hiding  data  transfer  costs. 
Third,  when  acceptable  by  the  application,  we  present  data  distributions  that  can  reduce  com¬ 
munication  costs.  Finally,  we  show  traversals  of  the  DFT  dataflow  graph  that  can  maximize  the 
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performance  of  banked  memory  systems.  In  addition,  we  designed  our  techniques  to  be  com¬ 
patible  with  previous  work  that  target  optimizations  for  other  architectural  paradigms  including 
SIMD  vectorization. 

In  summary,  our  system  enables  the  computer  generation  of  high  performance  DPT  libraries 
for  current  and  future  platforms  that  incorporate  the  distributed  memory  design  at  the  chip  level 
or  the  node  level. 


7.1  Current  Limitations 

As  with  most  research  work,  our  system  has  several  limitations,  which  we  discuss  below. 

Option  space  limitations..  The  cross  product  of  all  options  and  scenarios  that  a  system  could 
generate  code  for  is  huge,  as  can  be  seen  in  Table  1.1.  We  do  not  support  this  entire  space,  in  part 
due  to  the  engineering  effort  involved. 

Problem  dimensions.  We  focus  on  ID  DFTs,  and  also  produce  code  for  2D  DFTs  in  this 
thesis,  but  do  not  consider  3D  DFTs.  Since  these  are  easily  expressed  in  SPL,  an  extension  of  our 
work  to  cover  3D  DFTs  should  be  straightforward. 

Problem  size  limitations.  In  this  thesis,  we  focus  on  two-power  sizes.  However,  our  tech¬ 
niques  should  be  easily  extensible  to  support  non-two  power  composite  sizes  composed  of  small 
(<  7)  primes  that  are  are  compatible  with  the  underlying  SIMD  vectorization  paradigm.  Vector¬ 
ization  requires  the  sizes  to  be  multiples  of  the  square  of  the  SIMD  vector  length.  This  would 
primarily  he  an  engineering  limitation.  Extension  to  other  sizes  including  larger  primes  may  re¬ 
quire  the  use  of  other  FFT  algorithms  in  addition  to  techniques  similar  to  the  ones  described  in 
[Franchetti  and  Piischel,  2007] . 

Other  programming  paradigms.  Although  we  did  not  target  programming  paradigms  such 
as  UPC  and  Chapel,  which  are  primarily  useful  for  applications  that  use  them,  we  do  demon¬ 
strate  that  our  system  works  with  at  least  two  different  programming  paradigms:  a  shared  mem¬ 
ory  pthreads  based  DMA  paradigm,  and  the  distributed  memory  MPI  paradigm.  Since  we  per¬ 
form  most  of  our  rewriting  at  a  level  where  the  programming  paradigm  itself  is  abstracted  out, 
our  work  should  be  extensible  to  generate  libraries  for  other  programming  paradigms  with  ap¬ 
propriate  changes  of  the  backend. 


7.2  Future  Directions 

In  addition  to  addressing  the  limitations  discussed  above,  we  discuss  possibilities  for  extending 
this  research  work. 


7.2.  Future  Directions 
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One  area  that  warrants  research  is  that  of  applying  our  parallelization  and  streaming  tech¬ 
niques  to  compute  DFTs  on  emerging  hybrid  architectures.  In  addition  to  a  distributed  memory 
design,  such  architectures  may  incorporate  multicore  parallelism,  SIMD  vectorization,  GPUs, 
and  FPGAs  at  each  node.  Further,  distributed  memory  designs  may  exist  at  multiple  levels  of  the 
parallelism  hierarchy  in  such  systems. 

Another  area  of  research  is  the  extension  of  our  work  to  transforms  other  than  the  DFT.  Al¬ 
though  we  focus  on  DFT  libraries,  our  approach  is  based  on  parallelizing  at  the  SPL  level,  and 
hence  should  be  applicable  to  other  linear  transforms  that  can  be  expressed  in  SPL. 

Based  on  the  recent  generalization  of  SPL  to  the  Operator  Language  (OL)  [de  Mesmay,  2010; 
Franchetti  et  al.,  2009a],  library  generation  for  linear  algebra  functionality  may  be  possible. 
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